Now that we know a little more about how computers represent things (especially integers), there are a lot of ways we can use that to work with those values in our code.

We have already seen the sign flag, `SF`

that we can use with the `js`

and `jns`

instructions.

But all the processor is doing: copy the highest bit from the calculation and call it `SF`

. So, `SF`

is useful and convenient, but we're just inspecting one bit from the result and using it to make a decision.

We can do the same thing with other bits, if that's useful.

For example, when we test for even/odd integers, this is the typical C code:

if ( n%2 == 0 ) { printf("even"); } else { printf("odd"); }

We *could* implement that directly in assembly: `idiv`

, and look at the remainder.

That would mean using `idiv`

(recall: it divides `%rdx:%rax`

by its argument, which **must** be a register, and puts the remainder in `%rdx`

). This will compute `rdx = rdi%2`

:

mov $2, %r8 mov $0, %rdx mov %rdi, %rax idiv %r8

The problem: doing the division is *very* expensive: it takes 39–103 cycles to do a 64-bit `idiv`

on Haswell [Agner Fog's Instruction tables, p. 248].

We can do much better. We really just need to know if the least-significant bit is set: odd numbers have a 1 in the least-significant bit; even numbers don't.

e.g. \(6_{10} = 00000110_2\) and \(-1_{10} = 11111111_2\).

The `and`

instruction does bit-wise AND between its two operands: bit-by-bit, do an AND operation on each position, and that's the result.

The number 1 has **only** the least significant bit set, so if we AND it with anything, we're left with only the least-significant bit:

00101101 11101010 AND 00000001 AND 00000001 = 00000001 = 00000000

If we do that and get zero, it was even.

We can use that to do the same thing: this code also does `rdx = rdi%2`

:

mov %rdi, %rdx and $1, %rdx

That does the same thing as the `idiv`

solution and is about 24 times faster (with a quick test on one computer).

The lesson: computers are good at manipulating bits. If you can phrase your operation directly as a bit manipulation, it will probably be fast.

Also, not all instructions are equal: the processor has a lot more work to do for `idiv`

than `and`

. They both look like one line of assembly code, but there's more to know about how the processor handles them.

Should we change our C code in this case?

No. No no no no. tl;dr no.

The compiler's optimizer is quite capable of transforming the code that everybody understands and can read at a glance (`n%2==0`

) into fast assembly code (and possibly faster that what we would write by hand).

Write beautiful code and let the compiler do its job.

But there are some more things we can do in assembly, which might sometimes change the way we approach things in other languages. Abstractions leak.

There are often cases where you have to store a lot of yes/no or on/off or similar Boolean values. e.g. you have millions of records, and each might be displayed

or not, each might be active

or not, or any other boolean properties you can imagine.

Several bits like that can be stored as a bit field that can represent many bits in a single integer.

I'm going to switch to C.

The basic operations we need are bit-by-bit AND, OR, NOT, and maybe XOR. The equivalents in assembly and C (and many other programming languages):

Operation | Assembly | C |
---|---|---|

bitwise-AND | `and $1, %rdi` | `n = n & 1;` |

bitwise-OR | `or $1, %rdi` | `n = n | 1;` |

bitwise-XOR | `xor $1, %rdi` | `n = n ^ 1;` |

bitwise-NOT | `not %rdi` | `n = ~n;` |

Note: be careful not to confuse the bitwise operators (`&`

, `|`

, `~`

) with the boolean operators (`&&`

, `||`

, `!`

).

The bitwise operators work on integers, and the boolean operators work on boolean values (`true`

and `false`

) and do different operations, but because C is so permissive, you can legally use either of them in a `if`

or `while`

condition.

What we're going to do: store several single-bit (boolean) values in an integer (probably unsigned, but the bits don't care).

Each bit position can hold one of our fields with 1 = on/yes/true and 0 = off/no/false.

I'll define some compile-time constants for the type (to make it clear which integers are bit fields vs actual integers), and integers with a single bit set for each field we want to store.

#define BIT_FIELD uint8_t #define TALL 0x01 // 00000001 #define SHARP 0x02 // 00000010 #define LOUD 0x04 // 00000100 #define COOL 0x08 // 00001000 #define SHY 0x10 // 00010000

i.e. each record can be tall

or not, sharp

or not, loud

or not, …

A record with all fields off

is easy enough:

BIT_FIELD record = 0;

We can turn whichever bits we want on

by ORing together the constants:

BIT_FIELD record = SHARP | LOUD;

That will be a value with (only) the sharp

and loud

bits set.

We can check to see if bits are set with `&`

and one of the constants: we'll get a zero if the bit is zero, and something else otherwise.

If we look at the value

and check to see if the `SHARP | LOUD`

`LOUD`

bit is set, the bitwise-AND will be (expressed as binary and hexadecimal):

0b00000110 0x06 AND 0b00000100 0x04 = 0b00000100 0x04

That implies logic like this to check if a bit is set (recalling that the C `if`

tests for non-zero values):

if ( record & LOUD ) { printf("is loud\n"); } else { printf("is not loud\n"); }

(If you put an integer in a C `if`

condition like this, there's an implicit

comparison.)`!=0`

If we want to take a bitfield value and set (to 1) a specific bit:

// set the LOUD bit record |= LOUD;

To clear (to 0) a bit:

// clear the TALL bit record &= ~LOUD;

record &= ~LOUD;

That last example might deserve more explanation:

LOUD == 0x04 == 0b00000100 ~LOUD == 0b11111011

If we bitwise-AND that with anything, the result will be unchanged, except the `LOUD`

bit will be cleared to 0.

Remember how I said before: the `cmp`

is really just `sub`

, but it throws away the results (and sets the status bits)? That works because comparing differences to zero is useful.

\[ a-b < 0 \quad\Leftrightarrow\quad a < b \\ a-b > 0 \quad\Leftrightarrow\quad a > b \\ a-b = 0 \quad\Leftrightarrow\quad a = b \]

… and the status bits are set for positive/negative/zero after arithmetic operations.

It turns out that the `test`

instruction just does `and`

but throws away the result.

We have always used `test`

like this:

test %rdi, %rdi

… and bitwise-AND-ing anything with itself doesn't change the value. So, the result was comparing the original value (with zero because of the way we look at the status flags).

But we can use `test`

with other operands. If we want to check a bit, we can AND with a single-bit constant, just like in C.

We can define assembly-time constants in GNU Assembler code like this:

.equ TALL, 0x01 .equ SHARP, 0x02 .equ LOUD, 0x04 .equ COOL, 0x08 .equ SHY, 0x10

The equivalent of the `is_loud`

function in assembly:

is_loud: test $LOUD, %rdi jz print_not_loud print_loud: ... ret print_not_loud: ... ret

The equivalent of `record = SHARP | LOUD`

(using `%rbx`

as the variable):

mov $LOUD, %rbx or $SHARP, %rbx

(The compiler can do that at compile-time. We're going to do it at runtime here.)

To clear a bit (`data &= ~LOUD`

):

mov $LOUD, %rax not %rax and %rax, %rbx

To set a bit (`data |= LOUD`

):

or $LOUD, %rbx

(Again, the compiler can do `~LOUD`

at compile-time.)

We observed when first looking at binary: if you add a 0 to the right of a base-10 number, you multiply by 10, and for an unsigned binary (base-2) number, you multiply by 2.

\[ 00001101_2 = 13_{10} \\ 00011010_2 = 26_{10} \]

And in the opposite direction: if you drop a 0, you are dividing by the base.

If there's a 1 in the least-significant position of an unsigned binary value, dropping it divides by 2, rounding down.

\[ 00011011_2 = 27_{10}\\ 00001101_2 = 13_{10} \]

For signed values, it's the same story if we shift everything to the left and fill in a 0:

\[\begin{align*} 11111011_2 &= -5_{10}\\ 11110110_2 &= -10_{10} \end{align*}\]

When shifting to the right, we need to sign-extend to fill in the bits for signed values:

\[ 11111011_2 = -5_{10}\\ \boldsymbol{1}1111101_2 = -3_{10} \]

\[ 11111010_2 = -6_{10}\\ \boldsymbol{1}1111101_2 = -3_{10} \]

If we were thinking of these as unsigned values, filling in the left-most bit with 0 would have got the correct result: 251 and 250 would both shift one place right to 125.

In general, we can use that shift operation to multiply and divide by powers of two: shifting left \(n\) positions (filling with 0s) is equivalent to multiplying by \(2^n\).

And shifting right (filling with 0s for unsigned, and sign-extending for signed) is equivalent to dividing by \(2^n\) and rounding down.

In assembly, the `shr`

(shift right) and `shl`

(shift left) shift **unsigned** values. The `sar`

(shift-arithmetic right) and `sal`

(shift-arithmetic left) instructions are intended to shift **signed** values.

They take two operands and can shift by as many places as you like. This multiples `rdi`

by \(8=2^3\).

shl $3, %rdi

The shift instructions set the `CF`

flag to the *first* bit shifted out.

Or in table form:

Instr | Action |
---|---|

`shl $2, %rdi` | shift left, filling 0s |

`sal $2, %rdi` | shift left, filling 0s |

`shr $2, %rdi` | shift right, filling 0s |

`sar $2, %rdi` | shift right, filling sign bit |

Note: `shl`

and `sal`

are effectively the same instruction.

This only helps if we're multiplying/dividng by powers of two, but that's common and might be faster. e.g. this multiplies by 33 (\(=2^5+1\)):

mul33: mov %rdi, %rax sal $5, %rax # rax = 32 * n add %rdi, %rax # rax += n ret

This code also multiplies by 33:

mul33: mov %rdi, %rax imul $33, %rax ret

Which is faster? Traditionally, we would assume multiplication is slow and shifting+adding is fast. It's unclear if that's true in modern processors (see lab 5).

If you're in anything other than assembly, my advice is to write readable code.

uint64_t mul33(uint64_t n) { return 33*n; }

Let the compiler decide what implementation of that logic is faster.

An example where thinking about bitwise operations leads to a surprising algorithm…

The problem: count the number of 1 bits in a binary value. e.g. in a bitfield value, how many of the bits are set? This is known as the value's population count.

What I would have done: one iteration for each bit:

uint8_t popcount1(uint64_t n) { uint8_t count = 0; while (n > 0) { if (n % 2 == 1) { // 1 in the 1s place count += 1; } n >>= 1; // shift right 1 } return count; }

But look what happens if we subtract one from any unsigned integer:

001101001000 - 000000000001 = 001101000111

On the right, we had to borrow until we got a 1: it changes to 0 and 1s fill in to the right. In other words, subtracting one changes all the right-most 0s to 1s, and the first 1 to a 0.

Then if we do a bitwise-AND of \(n\) and \(n-1\):

n = 001101001000 n-1 = 001101000111 n & (n-1) = 001101000000

So, `n & (n-1)`

is `n`

but with the lowest-set bit cleared.

That leads to an algorithm (credited to Brian Kernighan) that also counts the set bits, but with one iteration per **set** bit, and no `if`

:

uint8_t popcount2(uint64_t n) { uint8_t count = 0; while (n > 0) { count += 1; n &= n - 1; // clear the lowest set bit } return count; }

There are other algorithms for population count, which are also fun.

It turns out, it's a common enough operation (and/or easy enough to build a circuit for) that there's an x86-64 instruction for it (after ≈2008), and a way to ask for it (or a fallback implementation) in C:

popcount: popcnt %rdi, %rax

uint8_t popcount3(uint64_t n) { return (uint8_t)__builtin_popcountll(n); }

The lesson, again: thinking in bits can lead to surprising algorithms. Knowing your processor helps too.