We have taken one specific tour of computer systems. But why?
A lot has been said about making code efficient
: fewer instructions, less time, etc.
Sometimes efficiency of your code just doesn't matter: if your code doesn't run very often/long, or spends almost all of its time waiting for network, or only makes a few calls to some library functions (Pandas, Node.js, etc) you have no control over, or …, then it's not worth the effort.
But sometimes efficiency does matter.
Every instruction has a cost: in time, electricity, battery life, CO2 emissions, cloud provider bills, etc.
Write good code. Sometimes understanding the system is what lets you do that. (Sometimes it's a better algorithm, different library choice, different programming language choice, … .)
Some topics I wish we had time for but are going to fall off the end of the semester, with a one-sentence summary if you're interested:
aligned.
padding.
syscall
instruction).I don't feel too bad about missing any of those. Semesters have finite length.
A lot of what we have said about assembly and
Notable differences: most other languages don't have the concept of raw pointers (i.e. where you can add to the pointer and explore elsewhere in memory), or manual free
/delete
because they're too hard to get right.
Overall interactions with the processor should be similar to C/C++: same machine instructions, pipelines, branch predictor, etc.
This is particularly true when your code is compiled to machine code. Typical in Rust, Fortran, Go. The details of the language and the compiler's optimizer are going to affect performance, but the processor is fundamentally the same.
Many languges are typically compiled to bytecode, a low-level representation of your logic, but not machine code that runs on any actual processor.
e.g. Java to JVM bytecode ; Scala to JVM bytecode ; Python to Python bytecode ; but possibly also C to LLVM bytecode .
That bytecode is then executed (≈interpreted) by a virtual machine that can run that bytecode quickly.
VMs typically have some performance penalty over native machine code, but it can be surprisingly little. Often because…
A VM can also just-in-time (JIT) compile the bytecode to machine code at runtime. If all goes well, a JIT can get basically the same performance as ahead-of-time compiled machine code.
e.g. JavaScript in your web browser, Python in PyPy or Cinder, Java and the common JVMs, PHP and HHVM, the .NET CLR.
One language I want to say a few things about: Rust. Rust is designed to be fast and compiled to machine code.
We'd expect performance similar to C compiled under clang: both use the LLVM (Low-Level Virtual Machine) compiler framework. Both first compile to LLVM IR (Intermediate Representation, a bytecode), which is then optimized and compiled to machine code.
These are idiomatic ways to sum a vector in Rust:
pub fn vec_sum_method(values: &Vec<f32>) -> f32 { values.iter().sum() }
pub fn vec_sum_fold(values: &Vec<f32>) -> f32 { values.iter().fold(0.0, |a, b| a + b) }
The
is an anonymous function: a function that takes two arguments (|a, b| a + b
a
, b
) and returns a+b
.fold()
method expresses use this function to combine values in an accumulator
.
Those get the same performance as (non-vectorized) C or C++. The compiler is capable of transforming these into the same loop we wrote by hand in assembly or the compiler wrote from our C for
loop.
Every time I write
in C, I die a little inside. I don't care about loop counters or indexing arrays: I want to combine my array values with addition. That's exactly what this expresses.for(i=0; i<len; i++)
values.iter().fold(0.0, |a, b| a + b)
If we don't care about the order of the additions, Rust can express addition but assume it's associative
with the fadd_fast
intrinsic:
pub fn vec_sum_add_fast(values: &Vec<f32>) -> f32 { values .iter() .copied() .fold(0.0, |a, b| unsafe { fadd_fast(a, b) }) }
Or we can give explicit SIMD operations with the (not yet stabilized) std::simd
module.
pub fn vec_sum_simd(values: &Vec<f32>) -> f32 { let chunks = values.iter().copied().array_chunks(); let v: f32x8 = chunks.clone().map(f32x8::from_array).sum(); let tail: f32 = chunks.into_remainder().unwrap().sum(); v.as_array().iter().sum::<f32>() + tail }
If we convince Rust to auto-vectorize, we get the same performance as hand-written SIMD assembly, or vectorclass, or C with -funsafe-math-optimizations
.
RUSTFLAGS='-C target-cpu=haswell' cargo run --release
Looking at the way they're compiled , we'd expect the same performance as C or assembly.
Or do it in multiple threads with Rayon's ParallelIterator
:
pub fn vec_sum_parallel(values: &Vec<f32>) -> f32 { values .par_iter() .copied() .reduce(|| 0.0, |a, b| unsafe { fadd_fast(a, b) }) }
Or run some benchmarks with Criterion.
RUSTFLAGS='-C target-cpu=haswell' cargo bench
It runs your code several times to check for variability in runtime and produces an HTML report.
We have been focussing on x86-64 processors.
They have a lot of historical baggage, but are very common in computers. There is also a massive collection of tools and documentation available, which is nice (and why I chose them for this course).
All of the concepts of this course apply to other modern processors. Details differ, but the ideas are the same.
The other processor architecture that you're likely to encounter at the moment is ARM. There are a wide variety of ARM processors: the design is licensed to many companies.
Your phone's or tablet's processor is likely ARM. So are Apple's M1 and M2 processors. Also, NVIDIA Grace, AWS Graviton, Raspberry Pis.
Like x86, ARM isn't new and has gone through many changes. The 64-bit ARM architecture is often referred to as AArch64
.
Generally modern ARM processors have: 31 general purpose registers + a stack pointer; pipelined and superscalar; L1, L2, L3 cache; 128-bit wide SIMD instructions (Neon
).
Even if you don't know ARM assembly, you can probably read it , at least a little. Maybe with some practice.
So, what did we learn?
Probably: don't write assembly to solve problems.
But: you have to know some assembly and how the system works to use it to its fullest.
Unless: you have to write assembly to solve that particular problem in a very specific way that can't be expressed any other way.
Computers are more complex than they seem.
random accessmemory.
… but none of those are true if you look closely. Abstractions leak.
I propose the most important things from this course to remember a year from now: