We have taken one specific tour of computer systems. What else is out there?
Some topics I wish we had time for but are going to fall off the end of the semester, with a one-sentence summary if you're interested:
aligned.
padding.
syscall
instruction).I don't feel too bad about missing any of those. Semesters have finite length.
A lot of what we have said about assembly and C/C++ is true in other languages as well.
Notable differences: most other languages don't have the concept of raw pointers (i.e. where you can add to the pointer and explore elsewhere in memory), or manual free
/delete
because they're too hard to get right.
Overall interactions with the processor should be similar to C/C++: same machine instructions, pipelines, branch predictor, etc.
This is particularly true when your code is compiled to machine code. Typical in Rust, Fortran, Go. The details of the language and the compiler's optimizer are going to affect performance, but the processor is fundamentally the same.
Many languges are typically compiled to bytecode, a low-level representation of your logic, but not machine code that runs on any actual processor.
e.g. Java to JVM bytecode ; Scala to JVM bytecode ; Python to Python bytecode ; but possibly also C to LLVM bytecode .
That bytecode is then executed (≈interpreted) by a virtual machine that can run that bytecode quickly.
VMs typically have some performance penalty over native machine code, but often surprisingly little. Often because…
A VM can also just-in-time (JIT) compile the bytecode to machine code at runtime. If all goes well, a JIT can get basically the same performance as ahead-of-time compiled machine code.
e.g. JavaScript in your web browser, Python in PyPy or Cinder, Java and the common JVMs, PHP and HHVM, the .NET CLR.
One language I want to say a few things about: Rust. Rust is designed to be fast and compiled to machine code.
We'd expect performance similar to C compiled under clang: both use the LLVM (Low-Level Virtual Machine) compiler framework. Both first compile to LLVM IR (Intermediate Representation, a bytecode), which is then optimized and compiled to machine code.
These are idiomatic ways to sum a vector in Rust:
pub fn vec_sum_fold(values: &Vec<f32>) -> f32 { values.iter().fold(0.0, |a, b| a + b) }
pub fn vec_sum_method(values: &Vec<f32>) -> f32 { values.iter().sum() }
The
is an anonymous function: a function that takes two arguments (|a, b| a + b
a
, b
) and returns a+b
. The .fold()
method expresses use this function to combine values in an accumulator
.
Those get the same performance as (non-vectorized) C or C++. The compiler is capable of transforming these into the same loop we wrote by hand in assembly or the compiler wrote from our C for
loop.
Every time I write
in C, I die a little inside. I don't care about loop counters or indexing arrays: I want to combine my array values with addition. That's exactly what this expresses.for(i=0; i<len; i++)
values.iter().fold(0.0, |a, b| a + b)
If we don't care about the order of the additions, Rust can express addition but assume it's associative
with the fadd_fast
intrinsic:
pub fn vec_sum_add_fast(values: &Vec<f32>) -> f32 { values .iter() .copied() .fold(0.0, |a, b| unsafe { fadd_fast(a, b) }) }
If we convince Rust to auto-vectorize, we get the same performance as hand-written SIMD assembly, or vectorclass, or C with -funsafe-math-optimizations
.
RUSTFLAGS='-C target-cpu=haswell' cargo run --release
Looking at the way they're compiled , we'd expect the same performance as C or assembly.
Or do it in multiple threads with Rayon's ParallelIterator
:
pub fn vec_sum_parallel(values: &Vec<f32>) -> f32 { values .par_iter() .copied() .reduce(|| 0.0, |a, b| unsafe { fadd_fast(a, b) }) }
Or run some benchmarks with Criterion.
RUSTFLAGS='-C target-cpu=haswell' cargo bench
It runs your code several times to check for variability in runtime and produces an HTML report.
We have been focussing on x86-64 processors.
They have a lot of historical baggage, but are very common in computers. There is also a massive collection of tools and documentation available, which is nice.
All of the concepts of this course apply to other modern processors. Details differ, but the ideas are the same.
The other processor architecture that you're likely to encounter at the moment is ARM. There are a wide variety of ARM processors: the design is licensed to many companies.
Your phone's or tablet's processor is likely ARM. So are Apple's M1 and M2 processors. Also, NVIDIA Grace, AWS Graviton, Raspberry Pis.
Like x86, ARM isn't new and has gone through many changes. The 64-bit ARM architecture is often referred to as AArch64
.
Generally modern ARM processors have: 31 general purpose registers + a stack pointer; pipelined and superscalar; L1, L2, L3 cache; 128-bit wide SIMD instructions (Neon
).
Even if you don't know ARM assembly, you can probably read it , at least a little. Maybe with some practice.
So, what did we learn?
Probably: don't write assembly to solve problems.
But: you have to know some assembly and how the system works to use it to its fullest.
Unless: you have to write assembly to solve that particular problem in a very specific way that can't be expressed any other way.
Computers are more complex than they seem.
random accessmemory.
… until you look closely.
I propose the most important things from this course to remember a year from now: