Processor Tricks

The story we have been telling ourselves about the processor so far (and what languages like C make us think the processor is doing):

It executes instructions,
…one at a time,
…and ≤1 gets done each cycle.

Only the first is true.

The Pipeline

Recall our overly-simplified look at a processor. We had an ALU (and surrounding circuitry):

The results of whatever calculation have to be waiting at their destination (register inputs, write to memory, etc) at the moment they are read.

The Pipeline

The processor needs a clock to synchronize the circuit. The clock speed has to be slow enough for the calculations to propagate through the circuit by the time they are needed (≈start of the next cycle).

On the other hand, we would like the clock to be as fast as possible so we can get a lot of work done per second.

The Pipeline

One possibility: only have really simple instructions. Then we could have a fast clock. Yes, but it's nice having instructions that do interesting things, and there's some limit to how simple instructions can be.

Another possibility: make the circuit physically smaller. Then signals would propagate faster through them. Again, there's a physical limit here.

The Pipeline

Another idea: if the ALU (or some other logic circuit) takes too long, break it up into multiple steps, and let the steps happen across multiple cycles.

We can put registers between each part of the circuit to hold the intermediate results. Then in each cycle, part of the calculation can happen; we load those registers with the result of that stage; those values become the input to the next stage of the calculation.

Something like this…

The Pipeline

The Pipeline

If we arranged the circuit like that, our cycle time could be ≈1/4 as long.

At any moment, there would be four calculations in-progress and we would still get one calculation done in each cycle.

Basically, the instructions are following each other through the circuit like an assembly line or conveyor belt.

The Pipeline

Sounds great. We'll call it a pipeline. That imagined architecture is a four-stage pipeline.

Generally, the pipeline would include more than the ALU: the instruction fetch and decode can be (multiple) stages in the pipeline. So could writing the result to registers/memory.

The Pipeline

Our instructions would march through the processor like this (time increasing to the right):

CPU pipeline — Wikimedia Commons File:Pipeline,_4_stage.svg

The Pipeline

That would be great if every instruction was independent:

mov $3, %rax
add %rbx, %rcx
imul $12, %rdx
inc %r10

Those four operations could go through our imagined pipeline with no problem.

The Pipeline

But a lot of our code is more like this, where one instruction depends on the results of the instruction before it (in %rax here):

mov $3, %rax
add %rax, %rcx
imul $12, %rdx
inc %r10

Now what? We can't start the add until the mov is done. If we put it in the pipeline before then, wouldn't the old value of %rax still be there?

The Pipeline

Ignoring that problem momentarily…

Modern processors have pipelines (or are pipelined). Haswell CPUs have pipelines with 14–19 stages. Other modern processors have similar pipeline lengths.

That allows faster clock speeds and lets more instructions run per second.

Data Hazards

Let's return to this example code where %rax is used immediately after being changed, and imagine a four-stage pipeline.

mov $3, %rax
add %rax, %rcx
imul $12, %rdx
inc %r10

This is a hazard: a case where the pipeline can't correctly do the next instruction because of something in-progress isn't done. This is a data hazard because we're waiting on data (the new %rax value).

Data Hazards

One solution that a processor could implement: if an instruction needs a result that's mid-calculation (like the add there), just pause.

Waiting to put instructions in the pipeline is a stall, or putting a bubble into the pipeline.

Data Hazards

Imagining we only need to delay the add one cycle, it would execute something like this:

Data Hazards

A stall generally doesn't have to be the full length of the pipeline: some of the stages could be done even if nearby instructions depend on each other.

e.g. fetching and decoding the instruction can be done before worrying about operands; it might be possible for a result to be sent into the next instruction as input before it's actually written to a register.

Data Hazards

Stalling the pipeline lets instructions execute correctly, but it would be nice to get more instructions out of the pipeline as quickly as possible.

Data Hazards

Another option (that processors actually do): out of order execution. Processors will look ahead for instructions that don't depend on pending calculations and start those.

They will actually look ahead hundreds of instructions for work they can do. [192 instructions? Agner Fog's microarchitecture guide, §10.1]

Data Hazards

Can we see a data stall happening? I think so, yes.

Let's sum an array of float the usual way.

#define DATA_T float

DATA_T array_sum_1(DATA_T* arr, uint64_t n) {
    DATA_T total = 0;
    for (uint64_t i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

≈0.40 ns/element.

Data Hazards

I'm going to unroll the loop a step (and assume length is divisible by two):

DATA_T array_sum_2(DATA_T* arr, uint64_t n) {
    DATA_T total = 0; 
    for (uint64_t i = 0; i < n - 1; i += 2) {
        total += arr[i];
        total += arr[i + 1];
    }
    return total;
}

Still ≈0.40 ns/element.

Data Hazards

The same, but using two accumulators:

DATA_T array_sum_3(DATA_T* arr, uint64_t n) {
    DATA_T total1 = 0, total2 = 0;
    for (uint64_t i = 0; i < n - 1; i += 2) {
        total1 += arr[i];
        total2 += arr[i + 1];
    }
    return total1 + total2;
}

Now ≈0.22 ns/element.

Data Hazards

I can't prove that's because of a data stall, but I can't think of any other explanation.

I think the original code spends a lot of time waiting for the total to be updated so the next += operation can happen.

The effect disappears with int64_t arrays: maybe the pipeline for integer operations is shorter or better optimized for adjacent operations?

Data Hazards

This loop-unrolled version is also faster: it makes half as many read/writes to (the register holding) total:

DATA_T array_sum_4(DATA_T* arr, uint64_t n) {
    DATA_T total = 0;
    for (uint64_t i = 0; i < n - 1; i += 2) {
        total += arr[i] + arr[i + 1];
    }
    return total;
}

≈0.22 ns/element.

Data Hazards

for (uint64_t i = 0; i < n; i++) {
    total += arr[i];
}

The effect decreases if there's a non-tiny amount of work happening in the loop. If there's enough happening that several instructions are necessary, then the running times converge.

Data Hazards

Please understand this as a lecture demo, not programming advice. There are better ways to deal with this problem: understanding why the optimizer isn't doing a critical optimization and expressing your intent properly in code or at the command line.

The right optimization flags to the compiler can get all of these running in ≈0.1 ns/element, but that's a floating point demo.

Branch Hazards

Remember our first-ever conditional branch?

max:
    cmp %rsi, %rdi
    jl a_is_less
    mov %rdi, %rax
    ret
a_is_less:
    mov %rsi, %rax
    ret

What instruction should go in the pipeline after the jl (mov %rdi or mov %rsi)?

The answer: we don't know and neither does the processor.

Branch Hazards

That's a branch hazard or a control hazard: after a conditional branch, we can't know what instructions should be put into the pipeline until the branch is decided.

The same obvious option as with a data hazard: stall.

Again, would make execution correct, but we have lots of conditional branches and pipelines are long.

Branch Hazards

And once again, modern processors actually do something more clever: guess.

The processor's branch predictor does makes a prediction. It guesses which path will be taken (jump or no jump) and speculatively puts those instructions into the pipeline.

Branch Hazards

If it was right, great. No penalty for the branch. (I'm imagining the first instruction is a correctly-predicted conditional branch.)

Branch Hazards

If it guesses wrong, it's a branch misprediction. Then the post-branch instructions that were put in the pipeline must be discarded and the correct instructions must be started.

Branch Hazards

In a diagram, a misprediction would be something like:

Branch Hazards

A branch misprediction is generally worse than a stall for a data hazard: the instructions that were started have to be discarded and other instructions must be started at the start of the pipeline.

Branch mispredictions are also much more likely to affect running time of real code and more likely to affect our lives as a programmer.

Branch Hazards

Predictions are based on past behaviour at that branch: a conditional branch with consistent behaviour will be predicted well and basically costs nothing.

e.g. the conditional jump at the end of a loop will be taken almost every time (except the last iteration) so it's predictable and should be fast.

Branch Hazards

But what if a branch is unpredictable? Let's add up only negative values from an array:

DATA_T sum_negative(DATA_T* arr, uint64_t n) {
    DATA_T total = 0;
    for (uint64_t i = 0; i < n; i++) {
        if ( arr[i] < 0 ) {
            total += arr[i];
        }
    }
    return total;
}

Branch Hazards

Recall the array sum example: ≈0.40 ns/element to add up every element. We know that comparing to zero is easy (for integers, but it's also true for floating point), so adding up just negatives should be overall less work than summing the whole array.

For an array of values that are randomly half positive and half negative, this takes ≈2.9 ns/element. Less arithmetic, but several times longer execution.

Branch Hazards

Maybe comparisons are actually just really slow? Same code, but on an array of all-negative values takes ≈0.40 ns/element. Exactly the same code that took ≈2.9 ns/element, but same time as the loop without the if.

An unpredictable branch is a huge cost.

Branch Predictor

How good is the branch predictor in modern processors?

Shockingly good.

Branch Predictor

We already saw this code taking very different times to run: 0.40 ns/element when the if body is always executed; 2.9 when it's random and there are a lot of mispredictions.

DATA_T sum_negative(DATA_T* arr, uint64_t n) {
    DATA_T total = 0;
    for (uint64_t i = 0; i < n; i++) {
        if ( arr[i] < 0 ) {
            total += arr[i];
        }
    }
    return total;
}

Branch Predictor

What if we run our array through this, making every other element negative: still half negative, but with a pattern?

for (uint64_t i = 0; i < LENGTH; i++) {
    if ( i % 2 == 0 ) {
        arr[i] = fabs(arr[i]);
    } else {
        arr[i] = -fabs(arr[i]);
    }
}

Then sum_negative takes ≈0.30 ns/element.

Branch Predictor

The predictor figured out the pattern of jump, no jump, jump, no jump, … and started predicting it.

Apparently, Haswell CPUs can predict repetitive patterns of jumps of at least 29 branches [Agner Fog's microarchitecture guide, §3.8]

That's pretty good for a lump of silicon and electrons.

Branch Predictor

So if possible, arrange your code so that branches are predictable.

That's probably easy for the branches around loop conditions: most loops run many times so the branch predictor will quickly figure out to predict continue the loop.

For an if/else, it's less clear. If performance really matters, maybe it's something to consider.

Avoiding Branches

The most reliable way to avoid unpredictable branches: don't have branches.

Maybe the problem can be done with fewer if/else tests than you think. Maybe there are tools to help…

Avoiding Branches

A problem to experiment with: given an array of integers (and its length), count the number of even values. In C:

uint64_t count_even_c(int64_t* array, uint64_t n) {
    uint64_t count = 0;
    for (uint64_t i = 0; i < n; i++) {
        if (array[i] % 2 == 0) {
            count += 1;
        }
    }
    return count;
}

Avoiding Branches

I'm going to switch to assembly so we can see exactly what's going on.

Registers I'm using:

%rdi: array pointer
%rsi: length
%rcx: loop counter
%rax: accumulator (the count)
%rdx: temporary

Avoiding Branches

The loop body of my count_even_jump, a direct translation of that logic (using a & 1 for mod 2):

    testq $1, (%rdi, %rcx, 8)
    jnz cej_not_even
    inc %rax
cej_not_even:
    inc %rcx

Running time: ≈2.6 ns/element (on a random array where the branch is unpredictable).

Avoiding Branches

For this specific problem, I can make the observation that a & 1 will be 0 for even numbers and 1 for odd numbers. I could use that to count the odd numbers:

mov (%rdi, %rcx, 8), %rdx
and $1, %rdx
add %rdx, %rax

And at the last moment, turn count into length-count to get the even numbers:

neg %rax
add %rsi, %rax
ret

Running time: ≈0.2 ns/element: a ∼15× speedup.

Conditional Moves

But not every problem has a convenient modulus trick to get the right result. There's a useful tool that we can use to conditional behaviour without conditional branches.

The conditional move instructions do a move operation (e.g. copy a value from one register to another), but only if some condition is true (e.g. the Z flag is set).

The conditions we can check are the same as for the conditional branches, using the same status flags as the conditional branches.

Conditional Moves

The cmovne instruction is conditional move if not equal (the conditional move counterpart of jne). So this assembly:

cmp %rsi, %rdi
cmovne %r10, %r11

… is roughly equivalent to this C:

if ( rdi != rsi ) { r11 = r10; }

No jump instruction needed for the if.

Conditional Moves

That first conditional branch we did was to find the max of two integers. This is an equivalent function with a conditional move:

max:
    mov %rdi, %rax
    cmp %rdi, %rsi
    cmovg %rsi, %rax
    ret

i.e. unconditionally copy the first argument to %rax. If the second argument was larger, copy it to %rax (overwriting the previous value).

Conditional Moves

Back to the count evens problem… The conditional move instructions can't take an immediate argument (only a register or memory value), so I'll set a constant 1 register:

mov $1, %r11

Then the body of the loop can become:

mov $0, %rdx
testq $1, (%rdi, %rcx, 8)
cmovnz %r11, %rdx
add %rdx, %rax

i.e. get %rdx to be 0 or 1, whichever we'd like to add to the total; add it to the total.

Conditional Moves

mov $0, %rdx
testq $1, (%rdi, %rcx, 8)
cmovnz %r11, %rdx
add %rdx, %rax

Running time: ≈0.3 ns/element.

Close to the a & 1 solution, but keeps the conditional behaviour, so is a more general technique.

Conditional Moves

This is another case where the C compiler might be able to save us and convert an if/else into a conditional move or other better implementation.

Whether or not this will be automatic is going to depend on the complexity of the code and the compiler.

Conditional Moves

For counting even numbers: the C code several slides back takes ≈0.2 ns/element if compiled with gcc -O2, about the same as our best assembly code.

With gcc -O3, it takes ≈0.1 ns/element. Once again, the compiler is a better assembly programmer than us.

With clang -O3, it's closer to ≈0.07 ns/element.

Conditional Moves

A summary of the count even numbers running times, in picoseconds per element with two significant figures, and looking at different array contents:

Method	Random	All Even	All Odd
Conditional branch	2600	250	220
`&` and `+`	190	190	190
Conditional move	290	290	290
C with `gcc -O2`	190	190	190
C with `gcc -O3`	95	95	95
C with `clang -O3`	75	75	75

Superscalar Processors

In our simple CPU sketch, there was an ALU: operands came in and a single result came out. We have been describing the pipelined ALU as if there's one of them in a processor.

There's more complexity here too…

Superscalar Processors

Modern CPUs have multiple execution units. Each is a pipelined ALU-like circuit, and each can be working on an instruction separately. A processor with multiple execution units is superscalar.

The execution units can be specialized: each will be able to do only certain types of operations.

Superscalar Processors

Haswell has eight execution units: see the block diagram (the EUs).

There are several execution units that can do most integer operations, but only one for integer multiplication; several for floating point, one that can do lea operations; etc.

This is why the pipeline length is a range: different execution units have different pipelines.

Superscalar Processors

Because of the multiple execution units, processors can complete more than one instruction per cycle (per core: this is all still within one CPU core).

The programmer doesn't have to do anything: the control/decode logic in the processor will do its best to choose instructions that will keep the various EUs busy.

Some EUs will almost certainly be idle at any given moment, but the processor can still be working on multiple instructions at the same time.

Superscalar Processors

e.g. in all of our code so far, any floating point EUs aren't getting used.

Haswell: the typical throughput of the whole design is four instructions per clock or fewer. [Agner Fog's microarchitecture guide §10.9]

Superscalar Processors

We can now at least guess why this loop (with probably 5 instructions: mov, add, inc, cmp, jl) can be done in 2 cycles per iteration.

for (uint32_t i = 0; i < length; i++) {
    total += data[indices[i]];
}

Multiple execution units were probably working on different instructions concurrently; instructions were being executed out of order to keep execution units busy. All we saw was iterations happening in <5 cycles.

Superscalar Processors

I think the operations in that loop would be categorized:

indices[i]: load data
data[…]: load data (part of the add)
total +=: integer
i++: integer
i < length: integer
for: branch

Superscalar Processors

My experience: I know my processor is superscalar, but it's very hard to directly observe what's happening, other than things getting done >1 instruction per cycle.

In theory, there are code optimizations that can be done for the superscalar architectures (to make better use of more execution units simultaneously), but it's a quite niche task.

Superscalar Processors

I suspect some of the examples in lab with weird outcomes (or that I didn't use in labs because they were too weird) were because of interactions between the specific instructions and the multiple execution units.

e.g. some instructions can appear to take ≈zero time because they are relatively independant of other nearby instructions and there is an execution unit to handle them sitting idle.

Superscalar Processors

Minor detail: the execution units are actually working on microoperations (μOPs) not instructions as we or the compiler write them. Instructions are decoded into (one or more) microoperations before they hit the execution units.

e.g. I suspect the idiv instruction takes many μOPs and add takes few.

That detail probably won't affect us, but you might see the term.

SMT

Superscalar processors allow a trick you might have heard of: simultaneous multithreading (SMT) or hyperthreading.

The goal: have a single CPU core work on multiple threads concurrently.

SMT

Very briefly: a thread is execution happening within a program. Each thread has its own collection of registers (including the instruction pointer and instruction register: can be running at a different place than other threads) and can get work done mostly independently of other threads. A process could have one thread. It could have many.

SMT

There are so many execution units in modern processors, that some are probably sitting idle at any moment.

SMT

If everything is perfect, SMT/hyperthreading can run two threads at full speed on a single CPU core. e.g. one thread that's doing nothing but integer operations, and another that's doing nothing but floating point.

It might get only one thread worth of work done, if both threads are waiting for the same EUs.

SMT

In practice, hyperthreaded work gets somewhere between 100% and 200% speed per core, depending on the instructions the threads are trying to run.

This is yet another factor that makes the performance of real code difficult to predict/measure: it's hard to be sure what fraction of the processor (core) we're actually working with.

SMT

Or maybe processor designers are starting to decide simultaneous multithreading isn't always worth the tradeoff of speed/complexity/power consumption.

Heterogeneous Cores

Some modern processors have multiples cores that are not identical, i.e. are heterogeneous.

In theory, these could be completely different (e.g. a CPU and a GPU on the same chip), but you're likely to see it directly as cores that implement the same instruction set, but with different performance.

Heterogeneous Cores

This is done because it's possible to design slower cores that use less power. Then if there's not much computation happening, it can be sent to the slower cores and save power/battery/heat.

This might involve a slower clock speed, fewer execution units, less cache, no hyperthreading, etc. Some examples you might encounter:

Intel 12th gen and later (P-cores vs E-cores),
Apple Silicon (high-performance vs high-efficiency),
most phone processors.

Heterogeneous Cores

As a programmer, high- and low-performance cores aren't usually your problem. The OS can assign threads as appropriate.

But, you might need to keep in mind when considering performance.

Heterogeneous Cores

e.g. my 16 core processor can run 24 threads: 8 hyperthreaded P-cores × 2 threads + 8 non-hyperthreaded E-cores. If I start 24 threads in a program, some are going to be on the slower cores (but may moved between cores by the OS).

I can't assume that each thread is going to get the same amount of work done.

Summary

Modern CPUs are amazingly complex systems.

Multiple instructions are in progress in two ways: the multi-stage execution of the pipeline and the concurrent execution of the superscalar architecture.

They aren't even necessarily executed in the order we gave them. It might share capacity with another thread without us knowing.

Summary

One of the problems: programming languages present us with an abstraction that looks like one operation at a time, one after the other, as we the (very smart) programmer arranged them.

Much of that is also true in assembly, where we imagine talking directly to the processor. The processor is happy to try to outsmart us.

Summary

In an average line of code, we can ignore all of this: the compiler will do what it does, and the processor will do its thing. Write beautiful, readable, idomatic code, and trust the system.

But sometimes details matter, and we have to know what's happening behind the scenes. Cache (memory locality) and the branch predictor (fewer or predictable branches) are the most common details to leak through.