Conclusion

Conclusion

We have taken one specific tour of computer systems. But why?

Conclusion

A lot has been said about making code efficient: fewer instructions, less time, etc.

Sometimes efficiency of your code just doesn't matter: if your code doesn't run very often/​long, or spends almost all of its time waiting for network, or only makes a few calls to some library functions (Pandas, Node.js, etc) you have no control over, or …, then it's not worth the effort.

Conclusion

But sometimes efficiency does matter.

Every instruction has a cost: in time, electricity, battery life, CO2 emissions, cloud provider bills, etc.

Write good code. Sometimes understanding the system is what lets you do that. (Sometimes it's a better algorithm, different library choice, different programming language choice, … .)

Not Covered

Some topics I wish we had time for but are going to fall off the end of the semester, with a one-sentence summary if you're interested:

  • Endianness: within a multi-byte value, is the low-order or high-order byte stored first?
  • Alignment: if a value is at a memory address divisible by its size, it's aligned.
  • Padding: C structs and C++ objects have their elements aligned, so there can be some unused bytes or padding.

Not Covered

  • System calls: you can ask the operating system to do something for you (the syscall instruction).
  • CISC vs RISC: it's in textbooks, but the distinction isn't super meaningful for modern processors.
  • Meltdown & Spectre: security vulnerabilities that take advantage of speculative and out-of-order execution to leak information.

Not Covered

Not Covered

I don't feel too bad about missing any of those. Semesters have finite length.

Other Languages

A lot of what we have said about assembly and C/C++ is true in other languages as well.

Notable differences: most other languages don't have the concept of raw pointers (i.e. where you can add to the pointer and explore elsewhere in memory), or manual free/delete because they're too hard to get right.

Other Languages

Overall interactions with the processor should be similar to C/C++: same machine instructions, pipelines, branch predictor, etc.

This is particularly true when your code is compiled to machine code. Typical in Rust, Fortran, Go. The details of the language and the compiler's optimizer are going to affect performance, but the processor is fundamentally the same.

Other Languages

Many languges are typically compiled to bytecode, a low-level representation of your logic, but not machine code that runs on any actual processor.

e.g. Java to JVM bytecode ; Scala to JVM bytecode ; Python to Python bytecode ; but possibly also C to LLVM bytecode .

Other Languages

That bytecode is then executed (≈interpreted) by a virtual machine that can run that bytecode quickly.

VMs typically have some performance penalty over native machine code, but it can be surprisingly little. Often because…

Other Languages

A VM can also just-in-time (JIT) compile the bytecode to machine code at runtime. If all goes well, a JIT can get basically the same performance as ahead-of-time compiled machine code.

e.g. JavaScript in your web browser, Python in PyPy or Cinder, Java and the common JVMs, PHP and HHVM, the .NET CLR.

Rust

One language I want to say a few things about: Rust. Rust is designed to be fast and compiled to machine code.

We'd expect performance similar to C compiled under clang: both use the LLVM (Low-Level Virtual Machine) compiler framework. Both first compile to LLVM IR (Intermediate Representation, a bytecode), which is then optimized and compiled to machine code.

Rust

These are idiomatic ways to sum a vector in Rust:

pub fn vec_sum_method(values: &Vec<f32>) -> f32 {
    values.iter().sum()
}
pub fn vec_sum_fold(values: &Vec<f32>) -> f32 {
    values.iter().fold(0.0, |a, b| a + b)
}

The |a, b| a + b is an anonymous function: a function that takes two arguments (a, b) and returns a+b. The .fold() method expresses use this function to combine values in an accumulator.

Rust

Those get the same performance as (non-vectorized) C or C++. The compiler is capable of transforming these into the same loop we wrote by hand in assembly or the compiler wrote from our C for loop.

Every time I write for(i=0; i<len; i++) in C, I die a little inside. I don't care about loop counters or indexing arrays: I want to combine my array values with addition. That's exactly what this expresses.

values.iter().fold(0.0, |a, b| a + b)

Rust

If we don't care about the order of the additions, Rust can express addition but assume it's associative with the fadd_fast intrinsic:

pub fn vec_sum_add_fast(values: &Vec<f32>) -> f32 {
    values
        .iter()
        .copied()
        .fold(0.0, |a, b| unsafe { fadd_fast(a, b) })
}

Rust

Or we can give explicit SIMD operations with the (not yet stabilized) std::simd module.

pub fn vec_sum_simd(values: &Vec<f32>) -> f32 {
    let chunks = values.iter().copied().array_chunks();
    let v: f32x8 = chunks.clone().map(f32x8::from_array).sum();
    let tail: f32 = chunks.into_remainder().unwrap().sum();
    v.as_array().iter().sum::<f32>() + tail
}

Rust

If we convince Rust to auto-vectorize, we get the same performance as hand-written SIMD assembly, or vectorclass, or C with -funsafe-math-optimizations.

RUSTFLAGS='-C target-cpu=haswell' cargo run --release

Looking at the way they're compiled , we'd expect the same performance as C or assembly.

Rust

Or do it in multiple threads with Rayon's ParallelIterator:

pub fn vec_sum_parallel(values: &Vec<f32>) -> f32 {
    values
        .par_iter()
        .copied()
        .reduce(|| 0.0, |a, b| unsafe { fadd_fast(a, b) })
}

Rust

Or run some benchmarks with Criterion.

RUSTFLAGS='-C target-cpu=haswell' cargo bench

It runs your code several times to check for variability in runtime and produces an HTML report.

Other Processors

We have been focussing on x86-64 processors.

They have a lot of historical baggage, but are very common in computers. There is also a massive collection of tools and documentation available, which is nice (and why I chose them for this course).

All of the concepts of this course apply to other modern processors. Details differ, but the ideas are the same.

Other Processors

The other processor architecture that you're likely to encounter at the moment is ARM. There are a wide variety of ARM processors: the design is licensed to many companies.

Your phone's or tablet's processor is likely ARM. So are Apple's M1 and M2 processors. Also, NVIDIA Grace, AWS Graviton, Raspberry Pis.

Other Processors

Like x86, ARM isn't new and has gone through many changes. The 64-bit ARM architecture is often referred to as AArch64.

Generally modern ARM processors have: 31 general purpose registers + a stack pointer; pipelined and superscalar; L1, L2, L3 cache; 128-bit wide SIMD instructions (Neon).

Other Processors

Even if you don't know ARM assembly, you can probably read it , at least a little. Maybe with some practice.

Conclusion

So, what did we learn?

Conclusion

Probably: don't write assembly to solve problems.

But: you have to know some assembly and how the system works to use it to its fullest.

Unless: you have to write assembly to solve that particular problem in a very specific way that can't be expressed any other way.

Conclusion

Computers are more complex than they seem.

  • Processors make it look like they execute one instruction at a time.
  • The way you arrange your data in memory doesn't seem like it matters: it's random access memory.
  • Floating point numbers look like real numbers.
  • Your data doesn't look like it's made of bits.

… but none of those are true if you look closely. Abstractions leak.

Conclusion

I propose the most important things from this course to remember a year from now:

  • Compilers are doing their best to write assembly for us. Their best is almost always good enough.
  • Processor cache and memory locality matter.
  • Predictable/​unpredictable branches affect performance by a lot.
  • With the right operations and data layout, SIMD instructions are a huge speed gain.
  • godbolt.org.