The story we have been telling ourselves about the processor so far (and what languages like C make us think the processor is doing):
Only the first is true.
Recall our overly-simplified look at a processor. We had an ALU (and surrounding circuitry):
The results of whatever calculation have to be waiting at their destination (register inputs, write to memory, etc) at the moment they are read.
The processor needs a clock to synchronize the circuit. The clock speed has to be slow enough for the calculations to propagate through the circuit by the time they are needed (≈start of the next cycle).
On the other hand, we would like the clock to be as fast as possible so we can get a lot of work done per second.
One possibility: only have really simple instructions. Then we could have a fast clock. Yes, but it's nice having instructions that do interesting things, and there's some limit to how simple
instructions can be.
Another possibility: make the circuit physically smaller. Then signals would propagate faster through them. Again, there's a physical limit here.
Another idea: if the ALU (or some other logic circuit) takes too long, break it up into multiple steps, and let the steps happen across multiple cycles.
We can put registers between each part of the circuit to hold the intermediate results. Then in each cycle, part of the calculation can happen; we load those registers with the result of that stage; those values become the input to the next stage of the calculation.
Something like this…
If we arranged the circuit like that, our cycle time could be ≈1/4 as long.
At any moment, there would be four calculations in-progress and we would still get one calculation done in each cycle.
Basically, the instructions are following each other through the circuit like an assembly line or conveyor belt.
Sounds great. We'll call it a pipeline. That imagined architecture is a four-stage pipeline
.
Generally, the pipeline would include more than the ALU: the instruction fetch and decode can be (multiple) stages in the pipeline. So could writing the result to registers/memory.
Our instructions would march through the processor like this (time increasing to the right):
That would be great if every instruction was independent:
mov $3, %rax
add %rbx, %rcx
imul $12, %rdx
inc %r10
Those four operations could go through our imagined pipeline with no problem.
But a lot of our code is more like this, where one instruction depends on the results of the instruction before it (in %rax
here):
mov $3, %rax
add %rax, %rcx
imul $12, %rdx
inc %r10
Now what? We can't start the add
until the mov
is done. If we put it in the pipeline before then, wouldn't the old
value of %rax
still be there?
Ignoring that problem momentarily…
Modern processors have pipelines (or are pipelined
). Haswell CPUs have pipelines with 14–19 stages. Other modern processors have similar pipeline lengths.
That allows faster clock speeds and lets more instructions run per second.
Let's return to this example code where %rax
is used immediately after being changed, and imagine a four-stage pipeline.
mov $3, %rax
add %rax, %rcx
imul $12, %rdx
inc %r10
This is a hazard: a case where the pipeline can't correctly do the next instruction because of something in-progress isn't done. This is a data hazard because we're waiting on data (the new %rax
value).
One solution that a processor could implement: if an instruction needs a result that's mid-calculation (like the add
there), just pause.
Waiting to put instructions in the pipeline is a stall, or putting a bubble into the pipeline.
Imagining we only need to delay the add
one cycle, it would execute something like this:
A stall generally doesn't have to be the full length of the pipeline: some of the stages could be done even if nearby instructions depend on each other.
e.g. fetching and decoding the instruction can be done before worrying about operands; it might be possible for a result to be sent into the next instruction as input before it's actually written to a register.
Stalling the pipeline lets instructions execute correctly, but it would be nice to get more instructions out of the pipeline as quickly as possible.
Another option (that processors actually do): out of order execution. Processors will look ahead for instructions that don't depend on pending calculations and start those.
They will actually look ahead hundreds of instructions for work they can do. [192 instructions? Agner Fog's microarchitecture guide, §10.1]
Can we see a data stall happening? I think so, yes.
Let's sum an array of float
the usual way.
#define DATA_T float
DATA_T array_sum_1(DATA_T* arr, uint64_t n) {
DATA_T total = 0;
for (uint64_t i = 0; i < n; i++) {
total += arr[i];
}
return total;
}
≈0.40 ns/element.
I'm going to unroll the loop a step (and assume length is divisible by two):
DATA_T array_sum_2(DATA_T* arr, uint64_t n) {
DATA_T total = 0;
for (uint64_t i = 0; i < n - 1; i += 2) {
total += arr[i];
total += arr[i + 1];
}
return total;
}
Still ≈0.40 ns/element.
The same, but using two accumulators:
DATA_T array_sum_3(DATA_T* arr, uint64_t n) {
DATA_T total1 = 0, total2 = 0;
for (uint64_t i = 0; i < n - 1; i += 2) {
total1 += arr[i];
total2 += arr[i + 1];
}
return total1 + total2;
}
Now ≈0.22 ns/element.
I can't prove that's because of a data stall, but I can't think of any other explanation.
I think the original code spends a lot of time waiting for the total
to be updated so the next
operation can happen.+=
The effect disappears with int64_t
arrays: maybe the pipeline for integer operations is shorter or better optimized for adjacent operations?
This loop-unrolled version is also faster: it makes half as many read/writes to (the register holding) total
:
DATA_T array_sum_4(DATA_T* arr, uint64_t n) {
DATA_T total = 0;
for (uint64_t i = 0; i < n - 1; i += 2) {
total += arr[i] + arr[i + 1];
}
return total;
}
≈0.22 ns/element.
for (uint64_t i = 0; i < n; i++) {
total += arr[i];
}
The effect decreases if there's a non-tiny amount of work happening in the loop. If there's enough happening that several instructions are necessary, then the running times converge.
Please understand this as a lecture demo, not programming advice. There are better ways to deal with this problem: understanding why the optimizer isn't doing a critical optimization and expressing your intent properly in code or at the command line.
The right optimization flags to the compiler can get all of these running in ≈0.1 ns/element, but that's a floating point demo.
Remember our first-ever conditional branch?
max:
cmp %rsi, %rdi
jl a_is_less
mov %rdi, %rax
ret
a_is_less:
mov %rsi, %rax
ret
What instruction should go in the pipeline after the jl
(mov %rdi
or mov %rsi
)?
The answer: we don't know and neither does the processor.
That's a branch hazard or a control hazard: after a conditional branch, we can't know what instructions should be put into the pipeline until the branch is decided.
The same obvious option as with a data hazard: stall.
Again, would make execution correct, but we have lots of conditional branches and pipelines are long.
And once again, modern processors actually do something more clever: guess.
The processor's branch predictor does makes a prediction. It guesses which path will be taken (jump or no jump) and speculatively puts those instructions into the pipeline.
If it was right, great. No penalty for the branch. (I'm imagining the first instruction is a correctly-predicted conditional branch.)
If it guesses wrong, it's a branch misprediction. Then the post-branch instructions that were put in the pipeline must be discarded and the correct instructions must be started.
In a diagram, a misprediction would be something like:
A branch misprediction is generally worse than a stall for a data hazard: the instructions that were started have to be discarded and other instructions must be started at the start of the pipeline.
Branch mispredictions are also much more likely to affect running time of real code and more likely to affect our lives as a programmer.
Predictions are based on past behaviour at that branch: a conditional branch with consistent behaviour will be predicted well and basically costs nothing.
e.g. the conditional jump at the end of a loop will be taken almost every time (except the last iteration) so it's predictable and should be fast.
But what if a branch is unpredictable? Let's add up only negative values from an array:
DATA_T sum_negative(DATA_T* arr, uint64_t n) {
DATA_T total = 0;
for (uint64_t i = 0; i < n; i++) {
if ( arr[i] < 0 ) {
total += arr[i];
}
}
return total;
}
Recall the array sum example: ≈0.40 ns/element to add up every element. We know that comparing to zero is easy (for integers, but it's also true for floating point), so adding up just negatives should be overall less work
than summing the whole array.
For an array of values that are randomly half positive and half negative, this takes ≈2.9 ns/element. Less arithmetic, but several times longer execution.
Maybe comparisons are actually just really slow? Same code, but on an array of all-negative values takes ≈0.40 ns/element. Exactly the same code that took ≈2.9 ns/element, but same time as the loop without the if
.
An unpredictable branch is a huge cost.
How good is the branch predictor in modern processors?
Shockingly good.
We already saw this code taking very different times to run: 0.40 ns/element when the if
body is always executed; 2.9 when it's random and there are a lot of mispredictions.
DATA_T sum_negative(DATA_T* arr, uint64_t n) {
DATA_T total = 0;
for (uint64_t i = 0; i < n; i++) {
if ( arr[i] < 0 ) {
total += arr[i];
}
}
return total;
}
What if we run our array through this, making every other element negative: still half negative, but with a pattern?
for (uint64_t i = 0; i < LENGTH; i++) {
if ( i % 2 == 0 ) {
arr[i] = fabs(arr[i]);
} else {
arr[i] = -fabs(arr[i]);
}
}
Then sum_negative
takes ≈0.30 ns/element.
The predictor figured out the pattern of jump, no jump, jump, no jump, …
and started predicting it.
Apparently, Haswell CPUs can predict repetitive patterns of jumps of at least 29 branches
[Agner Fog's microarchitecture guide, §3.8]
That's pretty good for a lump of silicon and electrons.
So if possible, arrange your code so that branches are predictable.
That's probably easy for the branches around loop conditions: most loops run many times so the branch predictor will quickly figure out to predict continue the loop
.
For an if
/else
, it's less clear. If performance really matters, maybe it's something to consider.
The most reliable way to avoid unpredictable branches: don't have branches.
Maybe the problem can be done with fewer if
/else
tests than you think. Maybe there are tools to help…
A problem to experiment with: given an array of integers (and its length), count the number of even values. In C:
uint64_t count_even_c(int64_t* array, uint64_t n) {
uint64_t count = 0;
for (uint64_t i = 0; i < n; i++) {
if (array[i] % 2 == 0) {
count += 1;
}
}
return count;
}
I'm going to switch to assembly so we can see exactly what's going on.
Registers I'm using:
%rdi
: array pointer%rsi
: length%rcx
: loop counter%rax
: accumulator (the count)%rdx
: temporaryThe loop body of my count_even_jump
, a direct translation of that logic (using
for mod 2):a & 1
testq $1, (%rdi, %rcx, 8)
jnz cej_not_even
inc %rax
cej_not_even:
inc %rcx
Running time: ≈2.6 ns/element (on a random array where the branch is unpredictable).
For this specific problem, I can make the observation that a & 1
will be 0 for even numbers and 1 for odd numbers. I could use that to count the odd numbers:
mov (%rdi, %rcx, 8), %rdx
and $1, %rdx
add %rdx, %rax
And at the last moment, turn count
into length-count
to get the even numbers:
neg %rax
add %rsi, %rax
ret
Running time: ≈0.2 ns/element: a ∼15× speedup.
But not every problem has a convenient modulus trick to get the right result. There's a useful tool that we can use to conditional behaviour without conditional branches.
The conditional move instructions do a move operation (e.g. copy a value from one register to another), but only if some condition is true (e.g. the Z
flag is set).
The conditions we can check are the same as for the conditional branches, using the same status flags as the conditional branches.
The cmovne
instruction is conditional move if not equal
(the conditional move counterpart of jne
). So this assembly:
cmp %rsi, %rdi
cmovne %r10, %r11
… is roughly equivalent to this C:
if ( rdi != rsi ) { r11 = r10; }
No jump instruction needed for the if
.
That first conditional branch we did was to find the max of two integers. This is an equivalent function with a conditional move:
max:
mov %rdi, %rax
cmp %rdi, %rsi
cmovg %rsi, %rax
ret
i.e. unconditionally copy the first argument to %rax
. If the second argument was larger, copy it to %rax
(overwriting the previous value).
Back to the count evens
problem… The conditional move instructions can't take an immediate argument (only a register or memory value), so I'll set a constant 1 register:
mov $1, %r11
Then the body of the loop can become:
mov $0, %rdx
testq $1, (%rdi, %rcx, 8)
cmovnz %r11, %rdx
add %rdx, %rax
i.e. get %rdx
to be 0 or 1, whichever we'd like to add to the total; add it to the total.
mov $0, %rdx
testq $1, (%rdi, %rcx, 8)
cmovnz %r11, %rdx
add %rdx, %rax
Running time: ≈0.3 ns/element.
Close to the
solution, but keeps the a & 1
conditional
behaviour, so is a more general technique.
This is another case where the C compiler might be able to save us and convert an if
/else
into a conditional move or other better implementation.
Whether or not this will be automatic is going to depend on the complexity of the code and the compiler.
For counting even numbers: the C code several slides back takes ≈0.2 ns/element if compiled with gcc -O2
, about the same as our best assembly code.
With gcc -O3
, it takes ≈0.1 ns/element. Once again, the compiler is a better assembly programmer than us.
With clang -O3
, it's closer to ≈0.07 ns/element.
A summary of the count even numbers
running times, in picoseconds per element with two significant figures, and looking at different array contents:
Method | Random | All Even | All Odd |
---|---|---|---|
Conditional branch | 2600 | 250 | 220 |
& and + | 190 | 190 | 190 |
Conditional move | 290 | 290 | 290 |
C with gcc -O2 | 190 | 190 | 190 |
C with gcc -O3 | 95 | 95 | 95 |
C with clang -O3 | 75 | 75 | 75 |
In our simple CPU sketch, there was an ALU: operands came in and a single result came out. We have been describing the pipelined ALU as if there's one of them in a processor.
There's more complexity here too…
Modern CPUs have multiple execution units. Each is a pipelined ALU-like circuit, and each can be working on an instruction separately. A processor with multiple execution units is superscalar.
The execution units can be specialized: each will be able to do only certain types of operations.
Haswell has eight execution units: see the block diagram (the EUs
).
There are several execution units that can do most integer operations, but only one for integer multiplication; several for floating point, one that can do lea
operations; etc.
This is why the pipeline length is a range: different execution units have different pipelines.
Because of the multiple execution units, processors can complete more than one instruction per cycle (per core: this is all still within one CPU core).
The programmer doesn't have to do anything: the control/decode logic in the processor will do its best to choose instructions that will keep the various EUs busy.
Some EUs will almost certainly be idle at any given moment, but the processor can still be working on multiple instructions at the same time.
e.g. in all of our code so far, any floating point EUs aren't getting used.
Haswell: the typical throughput of the whole design is four instructions per clock
or fewer. [Agner Fog's microarchitecture guide §10.9]
We can now at least guess why this loop (with probably 5 instructions: mov
, add
, inc
, cmp
, jl
) can be done in 2 cycles per iteration.
for (uint32_t i = 0; i < length; i++) {
total += data[indices[i]];
}
Multiple execution units were probably working on different instructions concurrently; instructions were being executed out of order to keep execution units busy. All we saw was iterations happening in <5 cycles.
I think the operations in that loop would be categorized:
indices[i]
: load datadata[…]
: load data (part of the add
)total +=
: integeri++
: integeri < length
: integerfor
: branchMy experience: I know my processor is superscalar, but it's very hard to directly observe what's happening, other than things getting done >1 instruction per cycle.
In theory, there are code optimizations that can be done for the superscalar architectures (to make better use of more execution units simultaneously), but it's a quite niche task.
I suspect some of the examples in lab with weird outcomes (or that I didn't use in labs because they were too weird) were because of interactions between the specific instructions and the multiple execution units.
e.g. some instructions can appear to take ≈zero time because they are relatively independant of other nearby instructions and there is an execution unit to handle them sitting idle.
Minor detail: the execution units are actually working on microoperations
(μOPs) not instructions as we or the compiler write them. Instructions are decoded into (one or more) microoperations before they hit the execution units.
e.g. I suspect the idiv
instruction takes many μOPs and add
takes few.
That detail probably won't affect us, but you might see the term.
Superscalar processors allow a trick you might have heard of: simultaneous multithreading (SMT) or hyperthreading.
The goal: have a single CPU core work on multiple threads concurrently.
Very briefly: a thread is execution happening within a program. Each thread has its own collection of registers (including the instruction pointer and instruction register: can be running at a different place than other threads) and can get work done mostly independently of other threads. A process could have one thread. It could have many.
There are so many execution units in modern processors, that some are probably sitting idle at any moment.
If everything is perfect, SMT/hyperthreading can run two threads at full speed on a single CPU core. e.g. one thread that's doing nothing but integer operations, and another that's doing nothing but floating point.
It might get only one thread worth of work done, if both threads are waiting for the same EUs.
In practice, hyperthreaded work gets somewhere between 100% and 200% speed per core, depending on the instructions the threads are trying to run.
This is yet another factor that makes the performance of real code difficult to predict/measure: it's hard to be sure what fraction of the processor (core) we're actually working with.
Or maybe processor designers are starting to decide simultaneous multithreading isn't always worth the tradeoff of speed/complexity/power consumption.
Some modern processors have multiples cores that are not identical, i.e. are heterogeneous.
In theory, these could be completely different (e.g. a CPU and a GPU on the same chip), but you're likely to see it directly as cores that implement the same instruction set, but with different performance.
This is done because it's possible to design slower cores that use less power. Then if there's not much computation happening, it can be sent to the slower cores and save power/battery/heat.
This might involve a slower clock speed, fewer execution units, less cache, no hyperthreading, etc. Some examples you might encounter:
As a programmer, high- and low-performance cores aren't usually your problem. The OS can assign threads as appropriate.
But, you might need to keep in mind when considering performance.
e.g. my 16 core processor can run 24 threads: 8 hyperthreaded P-cores × 2 threads + 8 non-hyperthreaded E-cores. If I start 24 threads in a program, some are going to be on the slower cores (but may moved between cores by the OS).
I can't assume that each thread is going to get the same amount of work done.
Modern CPUs are amazingly complex systems.
Multiple instructions are in progress in two ways: the multi-stage execution of the pipeline and the concurrent execution of the superscalar architecture.
They aren't even necessarily executed in the order we gave them. It might share capacity with another thread without us knowing.
One of the problems: programming languages present us with an abstraction that looks like one operation at a time, one after the other, as we the (very smart) programmer arranged them.
Much of that is also true in assembly, where we imagine talking directly to the processor. The processor is happy to try to outsmart us.
In an average line of code, we can ignore all of this: the compiler will do what it does, and the processor will do its thing. Write beautiful, readable, idomatic code, and trust the system.
But sometimes details matter, and we have to know what's happening behind the scenes. Cache (memory locality) and the branch predictor (fewer or predictable branches) are the most common details to leak through.