A few quick topics, because
We have been ignoring a detail of how our values are stored in memory.
The 32-bit integer 1234567816 is definitely 4 bytes: 0x12
, 0x34
, 0x56
, 0x78
, but are they stored the way you'd expect?
So far, there's no reason to know the answer: we simply treat the 4 bytes as a single unsplittable value and the processor does the arithmetic.
But we can explore: we can store that value, create a pointer to it, and look at the bytes.
uint32_t n = 0x12345678; uint8_t *p = (uint8_t*)&n; // get a pointer to bytes for(unsigned i=0; i<4; i++ ) { printf("*(p+%d) == 0x%02x\n", i, *(p+i)); }
*(p+0) == 0x78 *(p+1) == 0x56 *(p+2) == 0x34 *(p+3) == 0x12
That's likely not the order you'd expect: the byte representing the lowest-powers-of-two is stored first, and the highest-order bits are last.
i.e. the bits start at the little
end of the value. This computer is little endian.
That story is true for every computer I have physical access to, but on an emulated MIPS machine, that same code gives this output:
*(p+0) == 0x12 *(p+1) == 0x34 *(p+2) == 0x56 *(p+3) == 0x78
This is a big endian architecture. Either endianness is reasonable and it's just a choice of the architecture.
Usually you can ignore the endianness of your architecture: if you do a calculation (in assembly or a programming language) on a certain data type, the processor knows how to deal with it.
Endianness only becomes visible if you're dealing with multi-byte values as individual bytes, generally if you're storing or transmitting them.
For example, being told that some bytes in a file represent a 32-bit unsigned integer
isn't a complete description you can use to interpret those bytes.
Being given a 32-bit big-endian unsigned integer
is something you can deal with.
For example, Unicode characters can be encoded as UTF-32 in two different ways: UTF-32LE (little-endian integers) and UTF-32BE (big-endian integers).
Again, UTF-32 encoded text
isn't a complete description but UTF-32LE encoded text
is.
On the other hand, UTF-8 clearly defines the order of each byte, so endianness isn't relevant to it.
There's a character in Unicode zero width no-break space
, i.e. a space that takes up no space and doesn't allow line breaks there. Seems completely useless.
But it's character number FEFF16 and it's common to include it as the first character in a file/stream to make the endianness detectable. In this context, it's called the byte-order mark.
There's a similar question of how SIMD registers are arranged: if I load eight 32-bit integers into %ymm0
, which integer goes first
?
I've never had to worry about it, but if you find yourself shuffling values around, maybe you do.
When storing values in memory, there can be a speed penalty for some operations if it is not aligned.
Alignment requirements can vary but generally, a value is aligned
if its memory address is divisible by its size. e.g. a int64_t
is aligned if a pointer to it is divisible by 8.
This is why there were multiple instructions for SIMD moves: some that might be slightly faster for aligned values, and some that might be slower but more general.
Even as assembly programmers, we probably don't have to worry about alignment much.
It's not clear that there's much speed different for aligned/unaligned values in modern architectures. We can probably just assume everything is unaligned and move on.
But compilers do worry about alignment: it's worth it for them to arrange the world to get any speed benefit possible.
The place this is the most visible is how struct
s and classes (or similar things in other programming languages) are laid out in memory.
C and C++ make a couple of promises about the fields: every field/property is aligned, and they are stored in the order they are defined.
Consider this struct:
typedef struct { uint32_t a; uint8_t b; uint32_t c; uint8_t d; } stuff;
What if we look at the bytes that represent it?
stuff s = {.a=0x12345678, .b=0xff, .c=0x01010101, .d=0x44}; uint8_t* p = (uint8_t*)&s; for (unsigned i = 0; i < sizeof(stuff); i++) { printf("*(p+%d) == 0x%02x\n", i, *(p + i)); }
Six padding bytes are added (*(p+5)
to *(p+7)
and *(p+13)
to *(p+15)
) so the uint32_t
fields are aligned.
*(p+0) == 0x78 *(p+1) == 0x56 *(p+2) == 0x34 *(p+3) == 0x12 *(p+4) == 0xff *(p+5) == 0x00 *(p+6) == 0x00 *(p+7) == 0x00 *(p+8) == 0x01 *(p+9) == 0x01 *(p+10) == 0x01 *(p+11) == 0x01 *(p+12) == 0x44 *(p+13) == 0x00 *(p+14) == 0x00 *(p+15) == 0x00
Or, that struct is actually the same as this:
typedef struct { uint32_t a; uint8_t b; uint8_t padding1; uint8_t padding2; uint8_t padding3; uint32_t c; uint8_t d; uint8_t padding4; uint8_t padding5; uint8_t padding6; } stuff2;
… except this version makes the padding bytes explicit.
Rust makes no promise about the order of fields in a struct in memory: they must be aligned but it's free to rearrange them.
struct Stuff { a: u32, b: u8, c: u32, d: u8, }
It could put .b
and .d
together so less padding is required.
let s = Stuff { a: 0x12345678, b: 0xff, c: 0x01010101, d: 0x44 }; let p = &s as *const _ as *const u8; for i in 0..(mem::size_of::<Stuff>() as isize) { println!("*p.offset({}) == 0x{:02x}", i, unsafe {*p.offset(i)}); }
*p.offset(0) == 0x78 *p.offset(1) == 0x56 *p.offset(2) == 0x34 *p.offset(3) == 0x12 *p.offset(4) == 0x01 *p.offset(5) == 0x01 *p.offset(6) == 0x01 *p.offset(7) == 0x01 *p.offset(8) == 0xff *p.offset(9) == 0x44 *p.offset(10) == 0x00 *p.offset(11) == 0x00
The Rust compiler (with the version I used, with those compilation options, on my computer, on that day) put .b
after .c
to minimize padding bytes.
You can ask for a C-compatible memory layout if needed. This would be memory-equivalent to the C struct
a few slides ago.
#[repr(C)] struct Stuff { a: u32, b: u8, c: u32, d: u8, }
As much as writing assembly reveals how the system really
works, we haven't really ever interacted with anything other than registers and memory.
How does a program write to a file, or read network traffic, or how is it possible to read the clock? How does malloc
get us our memory?
The syscall
instruction is the key: it's a way to ask the operating system to take over and do something for us.
It's conceptually like a function call, but has a different calling convention. It isn't just a fancy jump like call
: our process is paused while the OS kernel does whatever we requested, then the OS resumes our process.
The first difference: we don't make a syscall to anything in particular. The syscall
instruction has no operands. The system call number has to be put in %rax
before the syscall
. That's how the OS knows what operation we want done. See the Linux Syscall Table or Linux kernel syscall tables.
Other arguments are put into registers differently than for call
, but it's the same general idea.
Let's print from assembly: that will be the write
syscall, which is number 1. From the table, I see the arguments:
%rdi
: the integer file descriptor(standard output is fd 1).
%rsi
: pointer to the start of the string.%rdx
: the number of bytes to write.Let's get the string sorted out first:
.data text: .ascii "some text to print\n" text_len: .quad 19
Then the arguments and the syscall (in .text
):
mov $1, %rax # == write mov $1, %rdi # == stdout's file descriptor lea text(%rip), %rsi # pointer to first byte mov text_len(%rip), %rdx # number of bytes syscall
Or we can ask the OS to create a directory with the mkdir
syscall (number 83). We need the directory name as a null-terminated string:
.data dirname: .ascii "created_dir\0"
And fire the syscall:
mov $83, %rax # == mkdir lea dirname(%rip), %rdi # pointer to directory name mov $0755, %rsi # mode (leading 0 = octal literal) syscall
Or exit the process, syscall 60:
mov $60, %rax # == exit mov $0, %rdi # exit code (0 = success) syscall
One syscall of note: brk
which is used to change the data segment size
. Basically: a way to tell the OS how much memory we need. Increasing the break allocates memory; decreasing it deallocates it.
That's what malloc
and free
have been using on our behalf this whole time.
We can infer what malloc
and free
must be doing: keeping a table of each allocation so they know what memory can be re-used or deallocated after a free
.
They need to call brk
as necessary to request memory from, or return memory to the OS. The brk
syscall could fail: that's when malloc
would return a null pointer indicating we didn't get the memory.
In our assembly code before now, we were relying on helpers.c
to make syscalls for us, including exit. This is the first truly-standalone assembly program we have seen.
One of the earlier compiler optimizations we saw was tail call optimization where a function immediately returns the result of another function (possibly a recursive call, but it could also be another function).
int f1(???) { ??? return f2(???); }
Consider what a tail call must look like in assembly. If we immediately return a return value, it will be something like:
outside_code: call f1 # return position 1 ↓ # ??? f1: # ??? call f2 # return position 2 ↓ # return value is already in %rax, so just... ret # to position 1 f2: # ??? ret # to position 2
As f2
is running, the stack must look like this:
There can be nothing else from f1
on the stack since it's about to ret
.
As f2
is a about to return, it must pop all of its stuff. Then its ret
pops return position 2
. Then f1
's ret
pops return position 1
.
The observation that lets tail call optimization work: returning from f1
has to: (1) get the return value in %rax
, (2) clean up the stack, (3) jump back to the instruction after it was called (return position 1
).
Transforming the code to this would get all of that done.
outside_code: call f1 # return position 1 ↓ # ??? f1: # ??? jmp f2 # instead of call # we never return to here f2: # ??? ret # actually to position 1
With that code, in the middle of f2
, the stack must look like this:
So, the ret
in f2
will actually return to return position 1
: where f1
was called from.
For a tail recursive function, the same transformation would change this:
recursive_func: # ??? j?? base_case recursive_case: # ??? call recursive_func ret base_case: # ??? ret
…to this:
recursive_func: # ??? j?? base_case recursive_case: # ??? jmp recursive_func base_case: # ??? ret
i.e. exactly the assembly code for a while
loop, with no recursive call.