Leftovers

Leftovers

A few quick topics, because

  1. they're relevant,
  2. they're interesting, [citation needed]
  3. we have time. [citation needed]

Endianness

We have been ignoring a detail of how our values are stored in memory.

The 32-bit integer 1234567816 is definitely 4 bytes: 0x12, 0x34, 0x56, 0x78, but are they stored the way you'd expect?

So far, there's no reason to know the answer: we simply treat the 4 bytes as a single unsplittable value and the processor does the arithmetic.

Endianness

But we can explore: we can store that value, create a pointer to it, and look at the bytes.

uint32_t n = 0x12345678;
uint8_t *p = (uint8_t*)&n;  // get a pointer to bytes
for(unsigned i=0; i<4; i++ ) {
    printf("*(p+%d) == 0x%02x\n", i, *(p+i));
}
*(p+0) == 0x78
*(p+1) == 0x56
*(p+2) == 0x34
*(p+3) == 0x12

Endianness

That's likely not the order you'd expect: the byte representing the lowest-powers-of-two is stored first, and the highest-order bits are last.

i.e. the bits start at the little end of the value. This computer is little endian.

Endianness

That story is true for every computer I have physical access to, but on an emulated MIPS machine, that same code gives this output:

*(p+0) == 0x12
*(p+1) == 0x34
*(p+2) == 0x56
*(p+3) == 0x78

This is a big endian architecture. Either endianness is reasonable and it's just a choice of the architecture.

Endianness

Usually you can ignore the endianness of your architecture: if you do a calculation (in assembly or a programming language) on a certain data type, the processor knows how to deal with it.

Endianness only becomes visible if you're dealing with multi-byte values as individual bytes, generally if you're storing or transmitting them.

Endianness

For example, being told that some bytes in a file represent a 32-bit unsigned integer isn't a complete description you can use to interpret those bytes.

Being given a 32-bit big-endian unsigned integer is something you can deal with.

Endianness

For example, Unicode characters can be encoded as UTF-32 in two different ways: UTF-32LE (little-endian integers) and UTF-32BE (big-endian integers).

Again, UTF-32 encoded text isn't a complete description but UTF-32LE encoded text is.

On the other hand, UTF-8 clearly defines the order of each byte, so endianness isn't relevant to it.

Endianness

There's a character in Unicode zero width no-break space, i.e. a space that takes up no space and doesn't allow line breaks there. Seems completely useless.

But it's character number FEFF16 and it's common to include it as the first character in a file/​stream to make the endianness detectable. In this context, it's called the byte-order mark.

Endianness

There's a similar question of how SIMD registers are arranged: if I load eight 32-bit integers into %ymm0, which integer goes first?

I've never had to worry about it, but if you find yourself shuffling values around, maybe you do.

Alignment

When storing values in memory, there can be a speed penalty for some operations if it is not aligned.

Alignment requirements can vary but generally, a value is aligned if its memory address is divisible by its size. e.g. a int64_t is aligned if a pointer to it is divisible by 8.

Alignment

This is why there were multiple instructions for SIMD moves: some that might be slightly faster for aligned values, and some that might be slower but more general.

  • movapd: Move Aligned Packed Double Precision Floating-Point Values.
  • movupd: Move Unaligned Packed Double Precision Floating-Point Values.
  • movdqa: Move Aligned Packed Integer Values.
  • movdqu: Move Unaligned Packed Integer Values.

Alignment

Even as assembly programmers, we probably don't have to worry about alignment much.

It's not clear that there's much speed different for aligned/​unaligned values in modern architectures. We can probably just assume everything is unaligned and move on.

Alignment

But compilers do worry about alignment: it's worth it for them to arrange the world to get any speed benefit possible.

The place this is the most visible is how structs and classes (or similar things in other programming languages) are laid out in memory.

C and C++ make a couple of promises about the fields: every field/​property is aligned, and they are stored in the order they are defined.

Padding

Consider this struct:

typedef struct  {
    uint32_t a;
    uint8_t  b;
    uint32_t c;
    uint8_t  d;
} stuff;

What if we look at the bytes that represent it?

stuff s = {.a=0x12345678, .b=0xff, .c=0x01010101, .d=0x44};
uint8_t* p = (uint8_t*)&s;
for (unsigned i = 0; i < sizeof(stuff); i++) {
    printf("*(p+%d) == 0x%02x\n", i, *(p + i));
}

Padding

Six padding bytes are added (*(p+5) to *(p+7) and *(p+13) to *(p+15)) so the uint32_t fields are aligned.

*(p+0) == 0x78
*(p+1) == 0x56
*(p+2) == 0x34
*(p+3) == 0x12
*(p+4) == 0xff
*(p+5) == 0x00
*(p+6) == 0x00
*(p+7) == 0x00
*(p+8) == 0x01
*(p+9) == 0x01
*(p+10) == 0x01
*(p+11) == 0x01
*(p+12) == 0x44
*(p+13) == 0x00
*(p+14) == 0x00
*(p+15) == 0x00

Padding

Or, that struct is actually the same as this:

typedef struct  {
    uint32_t a;
    uint8_t  b;
    uint8_t  padding1;
    uint8_t  padding2;
    uint8_t  padding3;
    uint32_t c;
    uint8_t  d;
    uint8_t  padding4;
    uint8_t  padding5;
    uint8_t  padding6;
} stuff2;

… except this version makes the padding bytes explicit.

Padding

Rust makes no promise about the order of fields in a struct in memory: they must be aligned but it's free to rearrange them.

struct Stuff {
    a: u32,
    b: u8,
    c: u32,
    d: u8,
}

It could put .b and .d together so less padding is required.

Padding

let s = Stuff { a: 0x12345678, b: 0xff, c: 0x01010101, d: 0x44 };
let p = &s as *const _ as *const u8;
for i in 0..(mem::size_of::<Stuff>() as isize) {
    println!("*p.offset({}) == 0x{:02x}",
             i, unsafe {*p.offset(i)});
}
*p.offset(0) == 0x78
*p.offset(1) == 0x56
*p.offset(2) == 0x34
*p.offset(3) == 0x12
*p.offset(4) == 0x01
*p.offset(5) == 0x01
*p.offset(6) == 0x01
*p.offset(7) == 0x01
*p.offset(8) == 0xff
*p.offset(9) == 0x44
*p.offset(10) == 0x00
*p.offset(11) == 0x00

Padding

The Rust compiler (with the version I used, with those compilation options, on my computer, on that day) put .b after .c to minimize padding bytes.

You can ask for a C-compatible memory layout if needed. This would be memory-equivalent to the C struct a few slides ago.

#[repr(C)]
struct Stuff {
    a: u32,
    b: u8,
    c: u32,
    d: u8,
}

System Calls

As much as writing assembly reveals how the system really works, we haven't really ever interacted with anything other than registers and memory.

How does a program write to a file, or read network traffic, or how is it possible to read the clock? How does malloc get us our memory?

System Calls

The syscall instruction is the key: it's a way to ask the operating system to take over and do something for us.

It's conceptually like a function call, but has a different calling convention. It isn't just a fancy jump like call: our process is paused while the OS kernel does whatever we requested, then the OS resumes our process.

System Calls

The first difference: we don't make a syscall to anything in particular. The syscall instruction has no operands. The system call number has to be put in %rax before the syscall. That's how the OS knows what operation we want done. See the Linux Syscall Table or Linux kernel syscall tables.

Other arguments are put into registers differently than for call, but it's the same general idea.

System Calls

Let's print from assembly: that will be the write syscall, which is number 1. From the table, I see the arguments:

  • %rdi: the integer file descriptor (standard output is fd 1).
  • %rsi: pointer to the start of the string.
  • %rdx: the number of bytes to write.

System Calls

Let's get the string sorted out first:

.data
text:
    .ascii "some text to print\n"
text_len:
    .quad 19

Then the arguments and the syscall (in .text):

mov $1, %rax              # == write
mov $1, %rdi              # == stdout's file descriptor
lea text(%rip), %rsi      # pointer to first byte
mov text_len(%rip), %rdx  # number of bytes
syscall

System Calls

Or we can ask the OS to create a directory with the mkdir syscall (number 83). We need the directory name as a null-terminated string:

.data
dirname:
    .ascii "created_dir\0"

And fire the syscall:

mov $83, %rax             # == mkdir
lea dirname(%rip), %rdi   # pointer to directory name
mov $0755, %rsi           # mode (leading 0 = octal literal)
syscall

System Calls

Or exit the process, syscall 60:

mov $60, %rax  # == exit
mov $0, %rdi   # exit code (0 = success)
syscall

System Calls

One syscall of note: brk which is used to change the data segment size. Basically: a way to tell the OS how much memory we need. Increasing the break allocates memory; decreasing it deallocates it.

That's what malloc and free have been using on our behalf this whole time.

System Calls

We can infer what malloc and free must be doing: keeping a table of each allocation so they know what memory can be re-used or deallocated after a free.

They need to call brk as necessary to request memory from, or return memory to the OS. The brk syscall could fail: that's when malloc would return a null pointer indicating we didn't get the memory.

System Calls

In our assembly code before now, we were relying on helpers.c to make syscalls for us, including exit. This is the first truly-standalone assembly program we have seen.

Tail Calls

One of the earlier compiler optimizations we saw was tail call optimization where a function immediately returns the result of another function (possibly a recursive call, but it could also be another function).

int f1(???) {
    ???
    return f2(???);
}

Tail Calls

Consider what a tail call must look like in assembly. If we immediately return a return value, it will be something like:

outside_code:
    call f1  # return position 1 ↓
    # ???
f1:
    # ???
    call f2  # return position 2 ↓
    # return value is already in %rax, so just...
    ret  # to position 1
f2:
    # ???
    ret  # to position 2

Tail Calls

As f2 is running, the stack must look like this:

stack during a tail call

There can be nothing else from f1 on the stack since it's about to ret.

As f2 is a about to return, it must pop all of its stuff. Then its ret pops return position 2. Then f1's ret pops return position 1.

Tail Calls

The observation that lets tail call optimization work: returning from f1 has to: (1) get the return value in %rax, (2) clean up the stack, (3) jump back to the instruction after it was called (return position 1).

Tail Calls

Transforming the code to this would get all of that done.

outside_code:
    call f1  # return position 1 ↓
    # ???
f1:
    # ???
    jmp f2  # instead of call
    # we never return to here
f2:
    # ???
    ret  # actually to position 1

Tail Calls

With that code, in the middle of f2, the stack must look like this:

stack during a tail call

So, the ret in f2 will actually return to return position 1: where f1 was called from.

Tail Calls

For a tail recursive function, the same transformation would change this:

recursive_func:
    # ???
    j?? base_case
recursive_case:    
    # ???
    call recursive_func
    ret
base_case:
    # ???
    ret

Tail Calls

…to this:

recursive_func:
    # ???
    j?? base_case
recursive_case:    
    # ???
    jmp recursive_func
base_case:
    # ???
    ret

i.e. exactly the assembly code for a while loop, with no recursive call.