Let's start at the low level: computers do things.
What do they do?
By now, we have an implicit definition of a program like a description of some work that a computer can do for us
.
A programming language lets a person create these descriptions. i.e. it's a way for a person to tell a computer what they want done.
Short and incomplete answer: turns your code into something that can run on a processor. Let's assume for now that it's machine code (or close enough: assembly code).
A programming language is designed to be some combination of (1) easy for a person to read/write, and (2) possible to translate to efficient machine code.
C is (said to be) a low level
language: what the programmer expresses is very similar to what the processor is going to do.
Unfortunately, everybody is lying to you…
Let's look at some C:
int a = 1; int b = 2; int c = a+b; return c;
What we imagine happens (in pseudo-assembly):
mov a, 1 ; create variable a mov b, 2 ; create variable b add c, a, b ; add into c push c ; put return value in place ret ; return
Those instructions are sent to the processor and execute in the order we wrote them. Obvious, but wrong.
Assertion: you can't really understand how your program runs without knowing how the processor runs it, at least at some level.
But, the usual description of how a processor runs
is from the 1970s. In particular, you (implicitly) learn:
This is also the model of a processor captured in C: sequential execution and flat memory are underlying assumptions of the way the language looks.
Those statements haven't been true for consumer processors since the 1980s. (e.g. both false for i486 and ARM3, released in 1989.)
At best, the abstraction presented by C is very leaky. At worst it's very misleading.
I'm going to need a modern processor to talk about: an Intel Core i7 6700K, a 4-core Skylake architecture (6th gen Core).
Other modern desktop, laptop, mobile processors will be similar but I don't want to spend time on 10 examples.
Sources for facts and/or further reading:
for assembly programmers and compiler makers.
Modern processors are pipelined: instructions are executed in several stages
. Each instruction takes several processor cycles to execute.
e.g. a 4 stage pipeline would execute instructions like this: one instruction completes every cycle, but 4 are in motion at a time. Wikipedia: Pipeline
That means that for code like this…
int a, b, c, d; a = 2 - 1; b = 1 + 1; c = a + b; d = 8 / 2;
… it seems like the values of a
and b
are not yet computed when the
line starts to execute. How do we get the correct value in c =
c
?
Skylake: pipelines are 14–19 steps [†]. An instruction's result isn't going to be completely done for a long time.
Either the pipeline must stall (do nothing for a few cycles waiting for results) or maybe execute instructions out-of-order (e.g. move on to the
line).d =
Processors do both: programmers/compilers are not responsible for knowing the details of the pipeline.
If there are upcoming instructions independent of the ones in the pipeline, they will be started. If not, stall.
Skylake: can look up to 224 instructions away for something to execute out of order. [‡ §11.1]
It gets worse. Consider this code:
int calculate_something(int a, int b) { int result; if ( a < b ) { result = a + b; } else { result = a - b; } return result; }
As the condition (<
) is evaluating, what should go into the pipeline next? An addition or a subtraction operation? Neither? Both?
Whenever there's a branch (i.e. a jump, goto, if
, bottom-of-loop, etc.), the processor will do branch prediction and guess what's going to happen.
Code will be speculatively put into the pipeline. If the prediction was correct, then yay.
If the prediction was wrong, that execution must be discarded, the pipeline flushed, and the correct branch started. Penalty: number of cycles ≈ pipeline depth.
If a branch at the bottom of a loop is mis-predicted often, the cost might be greater than everything happening in the loop.
Modern processors do lots of magic to avoid a mis-prediction.
In some cases, a programmer can make their job easier with consistent patterns of branches. (e.g. It might actually be worth it to sort a list before processing, if that will make more consistent branches.)
Skylake: can detect patterns at least 29 branches long and predict that they'll continue. [‡ §3.8]
Modern processor cores also have more than one execution unit, so more than one pipeline that can be doing things concurrently.
This is where hyperthreading comes from: one processor core could try to run two separate threads to make the best use of its pipelines.
Skylake: there are 8 execution units that do specialized tasks. They could all be executing instructions all the time, but four is more realistic. [†, ‡ §11.9]
Multiple execution units mean even more instructions can be in-motion at once, making pipeline stalls or branch mispredictions even weirder.
Skylake: up to 180 instructions can be moving at one time [*].
Another trick modern processors do: SIMD instructions (Single Instruction, Multiple Data). (You may know them by the Intel branding: MMX, SSE, AVX.)
The idea: you often want to do the same calculation on every element of an array. These instructions do the same operation on many values at once.
e.g. an array of many objects in 3D space that you want to project onto the 2D screen.
Skylake: supports AVX2 instructions, which have 256-bit wide SIMD paths. So, if you're operating on double
values, you can work on 4 of them at a time; for 16-bit integers, 16 at a time.
If you use the serial-calculation instructions in the CPU, you'd do the same calculations 4+ times slower.
Good news: C compilers might notice that you have done SIMD-like things, and automatically transform your code for you: example of SSE instruction generation in C.
Bad news: only with -O3
or -ftree-vectorize
, or it might not figure out complex code.
The SIMD instructions rely on all of the values being adjacent in memory. If you have an array of struct
or class
values, then they won't be.
struct scene_object { float x, y, z; char* label; }; struct scene_object objs[10000];
An array of struct
or class
values is arranged like this in memory:
If you operate on every objs[i].x
then they are aren't adjacent, so SIMD instructions can't be used. An array of struct
/objects is something to be suspicious of if performance matters.
A concrete array of structs
vs struct of arrays
example: suppose we have a big struct that we need an array of:
struct xy { double x, y; double other, stuff, because, that, happens; };
Or we could have a struct containing arrays.
struct xy_arr { double *restrict x, *restrict y; double *other, *stuff, *because, *that, *happens; };
We initialize the instances we need:
struct xy *s_xy = malloc(N * sizeof(struct xy)); struct xy_arr xy_a; xy_a.x = malloc(N * sizeof(double)); xy_a.y = malloc(N * sizeof(double));
Then we calculate \(\sum x + 2y\) on those values. Compile:
gcc -O3 -funsafe-math-optimizations -march=native ...
Speedup with struct of arrays: 2–3 times, depending on the architecture. Complete code. Compiler explorer on the code.
All processors you're likely to encounter have multiple cores. That is, multiple independant processing units which behave like processors on their own. (i.e. each has its own execution units, pipelines, SIMD instructions, etc.)
That includes mobile/embedded processors (often 4–8 cores, e.g. Snapdragon), laptop/desktop (often 4–16 cores), and server (often 16–96, e.g. Xeon, Epyc).
Our example Skylake: 4 cores.
Also remember hyperthreading: each core can run two threads simultaneously.
The implication: single-threaded code is not fast code.
From the programming language perspective, it would be very nice if the language made concurrent code easy and safe.
By now, the sequential execution
assumption is clearly false. Let's look at flat memory
.
Your computer has some memory (RAM): maybe 8–32 GB of it. All of your variables live there. You can access it as you like (thus Random Access Memory). But…
Suppose we have a big array (that we think of as 2D)…
const int size = 20000; int *arr = (int*)malloc(size*size*sizeof(int));
… do something to initialize the values and sum…
total = 0; st = clock(); for(i=0; i<size; i++) { for(j=0; j<size; j++) { total += arr[i*size + j]; // sum by row // total += arr[j*size + i]; // sum by column } } en = clock();
Code as-is: 0.14 s. With the other array indexing: 2.6 s.
[Want try it? Download the code and compile with gcc -O2
.]
Why does it matter so much (18× speed difference) whether we iterate by-row vs by-column?
Main memory is much slower than the processor: it takes dozens of cycles to fetch a value from RAM. We don't want the processor to wait that long.
Skylake: 4.00 GHz × 14.06 ns = 56 cycle memory latency [¶ with DDR4-2133].
To deal with this, processors have small amounts of faster memory that hold recently-used data from main memory: memory cache.
But capacity is limited: it will be holding a small amount of memory (a few pages) near what you have just accessed.
Implication: accessing memory close to what has been accessed recently will be many times faster than accessing something far away.
If you're going to traverse data, do it sequentially.
A (classic) linked list is an insane data structure to use in a modern memory architecture.
Modern CPUs have several levels of slower/larger/cheaper cache, possibly including separate cache for instructions and data. Our Skylake example: [†]
With multiple levels of cache, the penalty for accessing less-recently-used data might be less than a trip to memory, but still worse than hitting L1 cache.
The more local you can be in your memory accesses, the better.
One row in our 2D array example used 20000 int
s × 4 bytes/int
= 78 kB. Jumping to the next row was likely always an L1 cache miss.
The whole array was 1.5 GB, so we went to memory rarely when iterating across rows, and usually/often when iterating columns.
The lesson: it really matters how you access memory in your code.
If you're iterating through data, you need to know how it's arranged in memory.
Also, look back at the struct
example: more unrelated data in one array will mean less cache locality.
The potential speedups over the dumbest single-threaded C code:
× 1280 between getting all of them wrong and right. That's a definite exaggeration for any real code, but knowing and using your processor is important.
Why is this in CMPT 383? Because programming languages can…
Consider this equivalent C and Python code that adds one to each array element:
int arr[1000]; // ... fill in array for( int i=0; i<1000; i++ ) { arr[i] += 1; }
import numpy arr = numpy.empty(1000, dtype=numpy.int) # ... fill in array arr += 1
The C code expresses exactly how to calculate, but not what's actually going to happen.
Some things the processor does are out of any programmer's hands, because they're complicated and incompatible between processors:
But a programmer could write code that is better or worse for these.
Some things are under control of the programmer, and should be used effectively:
Programming languages should help. Maybe C is not a low-level language.