Returning to C

We have seen a lot of C (and C++) through this course.

It was mostly doing two things: giving us a way to call assembly code, and acting as a basis to compare the performance of our assembly code.

Returning to C

Usually the compiler proved itself to be a better assembly programmer than us. The C was almost always faster than (or similar to) our hand-written assembly. The only two exceptions (I think, on most processors):

Hailstone: we knew the two sides of the calcuation were very simple and that the branch would be unpredictable. Then it was slightly better to calculate both and do a conditional move.
Floating point calculations where we rearranged the calculations to allow SIMD operations.

Returning to C

Performance of the hailstone example was close to the conditional branch. A slightly improved branch predictor, or improved pipeline with a lower misprediction penalty would negate the cmov benefit in this case.

For the floating point code, it would be nice if we could express assume + is associative: I don't care about the exact rounding.

C and Assembly

We have seen that C can usually get the same performance as hand-written assembly (or better). How can we make sure it always does?

There are many ways to write C with bad performance: bad algorithms (e.g. exponential Fibonacci algorithm, bubble sort), unpredictable branches, bad memory access (poor locality = cache misses), bad data layout (e.g. array of structs/objects), …

C and Assembly

Now that we know those pitfalls, we can avoid them (or at least make an informed decision about avoiding/ignoring them).

Almost always, having an idea what assembly the compiler will write for us is more important that hand-writing that assembly.

C and Assembly

Of course, this is often fairly straightforward. We have a pretty good idea that this C:

int64_t f(int64_t num) {
    return 32 * num;
}

… will turn into assembly like this:

f:
    mov %rdi, %rax
    sal $5, %rax
    ret

… and won't worry too much about it when writing C.

C and Assembly

What if we're not sure?

e.g. do C++ vector operations get turned into SIMD instructions or not?

int32_t sum_vector(std::vector<int32_t> vec) {
    int32_t total = 0;
    for (uint64_t i = 0; i < vec.size(); i++) {
        total += vec[i];
    }
    return total;
}

C and Assembly

Of course we could look. Compiling with -S will get us the assembly output. Machine-generated code is never beautiful but somewhere in there, I see:

.L6:
	vpaddd	(%rax), %ymm1, %ymm1
	addq	$32, %rax
	cmpq	%rdx, %rax
	jne	.L6

The vpaddd instruction is add packed 32-bit integers. Yes, it's a SIMD operation.

If we change the data type to float, the work is done by addss, a scalar operation.

Compiler Explorer

Looking at gcc -S output directly is annoying. Luckily other people have been annoyed by it and helped us.

The Compiler Explorer at godbolt.org will run our code through a compiler and show us the assembly output, helpfully annotated.

Let's have a look at the sum_vector code in godbolt.org.

Compiler Explorer

General notes:

Select programming language at the top of the editor window (left).
Write/paste code in the editor window.
Select the compiler/version and command line options at the top of the compiler window (right).
Under Output…, you can select AT&T/Intel assembly syntax.

Compiler Explorer

In the case of our sum_vector code:

We see the short/fast loop with vpaddd.
There's code for tail of the vector if the length is not divisible by 8.
It deals with the length 0 case before the loop, not with a top-of-loop condition check.
It uses more instructions than I would have before/after the loop, but those only run once so who cares.

Compiler Explorer

On the compilation:

Using -O1 creates much more readable assembly that's closer like what I wrote in C++.
Trying with clang -O3 (or -O2) gets a much more aggressive compilation: partially-unrolled loop with several accumulators to (presumably) avoid data stalls. It doesn't make a speed difference on my processor, but maybe on some?

Compiler Explorer

Other notes:

Right-click or ctrl-F8 to see docs for an instruction.
You can Add new to get another compiler window for side-by-side comparison.
If compilation fails, check the Output.

Compiler Explorer

The Compiler Explorer is an extremely valuable tool (and we thank Matt Godbolt for all of his work on it).

If you take nothing else from this course aside from being able to read and understand its output, then it's not a total loss.

Of course, knowing what assembly output you want requires knowing a lot of what we have discussed.

Getting Good Assembly

If we don't like the assembly the compiler is producing, what can we do about it?

First: pay attention to compiler optimization (more later).

Getting Good Assembly

If you really want to control the assembly output, it's possible to write inline assembly in C and C++. It should be considered a last-resort, but it can be done.

When inserting raw assembly code, we have to give the compiler enough information to integrate it with the code it wrote (and registers its using, etc).

Getting Good Assembly

Besides the assembly, we need to say which C variables we're changing, reading, and which registers we're going to overwrite the contents of.

__asm__(
    assembly_code
    : output_operands
    : input_operands
    : registers_used
);

Let's try it. Create a few C variables and do some calculations in assembly…

Getting Good Assembly

uint64_t a, b, c, d;
a = 100;
b = 200;
c = 3;
d = 4;
__asm__(
    "mov $5000, %%rax\n"
    "mov %%rax, %0\n"
    "add %2, %0\n"
    "mov %3, %1\n"
    : "+r"(c), "+r"(d)  // %0 and %1 in the assembly
    : "r"(a), "r"(b)    // %2 and %3 in the assembly
    : "rax"             // %rax is clobbered
);
assert(a == 100 && b == 200 && c == 5100 && d == 200);

We get what we asked for .

Getting Good Assembly

Inline assembly is almost certainly a bad idea: it makes your C code non-portable (i.e. won't work on ARM); has all of the danger of assembly, plus worse interactions with the compiler; is almost certainly achievable some other way.

Getting Good Assembly

We saw the vectorclass library: it let us express ourselves in C++, but in a way that very carefully directs the assembly output.

Usually we can just rely on the compiler and optimizer to do its thing, but if you really care about every clock cycle, libraries like that might be worth it.

Getting Good Assembly

If we look at the implementation of vectorclass, in some sense it's very simple.

e.g. the code to implement + on two Vec8f is just an operator overload (a C++ operator+ definition) that makes one call, to _mm256_add_ps.

What's that?

Getting Good Assembly

It's a compiler intrinsic: a function that has some special meaning to the compiler. Typically, they mean use this assembly instruction, or some fallback if it's not available.

Getting Good Assembly

In this case, the _mm256_add_ps instruction adds float32 vectors. The corresponding… instruction is VADDPS.

So, it's a way to get single-precision SIMD addition

Getting Good Assembly

And godbolt confirms that _mm256_add_ps gets us exactly the vaddps instruction.

In this case, without -march=haswell, compilation fails: for this intrinsic, it's that instruction or nothing.

Getting Good Assembly

Earlier in the course, we met __builtin_popcountll, an intrinsic for counting the number of set bits.

uint8_t popcount3(uint64_t n) {
    return (uint8_t)__builtin_popcountll(n);
}

I said this was an intrinsic that requested the popcntq instruction, and I wasn't lying .

If we remove the -march=haswell when compiling, we get call __popcountdi2 for some other implementation provided by the compiler.

The Optimizer

But compiler intrinsics and inline assembly are still not the tool you need for most code.

Mostly, the compiler is smart enough to turn your code into something fast. How smart is it, and how can we help it?

The optimizer will do its best to transform what you wrote into something fast (or if you ask, small).

The Optimizer

We have been accessing compiler optimization all semester by adding switches like -O1 to our compiler command line. Each of the three levels of optimizations are a collection of specific optimizations that the compiler can be done to code.

For example, there's an optimization flag -finline-functions that is enabled as part of -O2 (and therefore also -O3).

The Optimizer

We could either select one of the packs of optimizations:

gcc -O1 file.c

Or turn on those optimizations individually:

gcc -fauto-inc-dec -fbranch-count-reg … file.c

The optimizations make compilation take more time, but at least the lower levels are generally worth it.

The Optimizer

Roughly,

-O0: no optimization (the default).
-O1: basic optimizations that don't take that much more compile time.
-O2: most optimizations.
-O3: aggressively optimize, even if it greatly increases compile time or executable size.
-Og: optimize for debugging, so assembly behaviour maps closely to the C code. ≈-O1
-Os: optimize for smaller executable. ≈-O2

The Optimizer

Historically, -O3 could sometimes cause worse performance because it's so aggressive with its choices. That's less true now (but some of you saw it on Lab 9.)

The traditional advice was to use -O2 for release because it was fast but safe.

I have been suggesting -O3 because it turns on auto-vectorization (later).

The Optimizer

Typically, Clang is a little more aggressive about what it does in the lower optimization levels.

Everything we're about to see is true in optimizing compilers in general, with a few differences about what happens at -O1 vs -O2 vs -O3.

The Optimizer

Let's start with some basics. Before this course, you probably thought variables lived in memory, maybe on the stack.

int64_t variable_usage(int64_t x, int64_t y) {
    int64_t a = x + 54941 * y;
    int64_t b = -a * y;
    return b;
}

They might (and do with no optimization ), or they might be kept in registers (and are with -O1 ).

The Optimizer

But note in the -O1 compilation: the code doesn't actually calculate or store the variable I called a. It saved one instruction by calculating the negation of a, as if I wrote:

int64_t algebra(int64_t x, int64_t y) {
    int64_t negative_a = -54941 * y - x;
    int64_t b = negative_a * y;
    return b;
}

That compiles to the same assembly .

The Optimizer

The compiler is also happy to do evaluate constant expressions at compile time.

There's no need to get out a calculator when writing code. Write expressive code and Let the compiler do the arithmetic .

The Optimizer

The basic tricks we had to do manually in assembly will be done by the compiler automatically.

int64_t const_mult_1(int64_t a) {
    return a * 16;
}
int64_t const_mult_1a(int64_t a) {
    return a << 4;
}

Those compile to the same assembly .

The Optimizer

And some more we didn't think of.

int64_t const_mult_2(int64_t a) {
    return a * 18;
}

That compiles in an unexpected way :

leaq    (%rdi,%rdi,8), %rax
addq    %rax, %rax
ret

The Optimizer

The lea instruction basically does simple math. The address referred to by (%rdi,%rdi,8) is rdi + rdi*8, or 9 times %rdi.

So the compiler got an integer addition and multiplication and put the result in a different register by slightly-abusing lea. That's clever. We might have written this instead of that lea:

mov %rdi, %rax
shl $3, %rdi
add %rdi, %rax

The Optimizer

We can try to outsmart the compiler and pre-optimize like this:

int32_t const_mult_3(int32_t a) {
    return a * 65599;    // 65599 == 2^16 + 2^6 - 2^0
}
int32_t const_mult_3a(int32_t a) {
    return (a << 16) + (a << 6) - a;
}

Nope . The compiler decides the multiplication is cheaper than the shifts and adds, and fixes our over-eager and less-readable code. [Example from Matt Godbolt @31:00.]

The Optimizer

The compiler will also do things like moving expressions out of a loop if they don't need to be repeated (loop-invariant code motion). e.g. the y*z here:

void loop_invariant_1(int* a, unsigned n, int y, int z) {
    for (unsigned i = 0; i < n; i++) {
        a[i] = 6 * i + y * z;
    }
}

The multiplication is done outside the loop . [Example adapted from Wikipedia.]

The Optimizer

The optimizer will sometimes replace a branch with a cmov (infrequently, in my experience):

int cond_move_1(int x, int y) {
    int result;
    if (y <= 9) {
        result = x + 1;
    } else {
        result = x - 1;
    }
    return result;
}

In a simple case like this, we get a conditional move , but not if things get a little more complex.

The Optimizer

We saw (but didn't really talk about) the cost of calling a function: we at least have the cost of pushing the return address (by call) and popping it (by ret). Maybe also preserving registers, etc.

So, maybe we don't want to do a function call for every tiny function we write. The compiler can inline simple functions: basically insert their code in place of the function call.

The Optimizer

double logistic(double x) {
    return 1 / (1 + exp(-x));
}
void logistic_fill(double* arr, unsigned n) {
    for (unsigned i = 0; i < n; i++) {
        arr[i] = logistic((double)i);
    }
}

The logistic function, is not called at -O2 , but is at -O1.

But exp is called: it's defined in a separate file (math.h and friends), so can't be inlined (also, it might not be worth it, depending on its definition).

The Optimizer

The lesson, I think, in these examples: write beautiful, expressive, clear, idiomatic code. The optimizer can figure out how to rearrange it to be fast.

Auto Vectorization

We would really like to be able to use the SIMD functionality of our processors without writing assembly or having to use low-level tools like intrinsics (or vectorclass or similar).

Auto Vectorization

We have seen how much performance can be gained with SIMD instructions. Keeping an eye on how the compiler uses them on our code can make a big difference.

Some SIMD things happen at -O2, but mostly at -O3.

Auto Vectorization

void add_four_double(double* a, double* b,
                     double* __restrict c) {
    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1];
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];
}

Compiling this generates a vaddpd instruction only if we add the __restrict qualifier to the third argument.

Auto Vectorization

Saying __restrict promises that there's no way to modify that value from anywhere else: no aliases. If there were aliases (e.g. c == a + 1), then turning this into a single vaddpd would be incorrect.

We meet a limitation of the abstraction presented by the language: C wasn't designed to express vector-like thoughts, so it can take some work to convince the compiler it's okay. (Same for C++ and most other languages.)

Auto Vectorization

We have seen that summing an array/vector:

DATA_T array_sum_1(DATA_T* array, uint64_t length) {
    DATA_T sum = 0;
    for (size_t i = 0; i < length; i++) {
        sum += array[i];
    }
    return sum;
}

… uses SIMD instructions for integer types but not floating point types, and not until -O3.

Auto Vectorization

That's going to be a fairly common pattern: you usually don't get SIMD stuff lower than -O3 and optimization options are more restricted for floating point operations.

It sure would be nice to be able to convince the compiler it's okay to rearrange floating point operations…

Unsafe Math

We saw before: the compiler won't rearrange/reorder/apply algebra to floating point operations because it can cause a different result. It's not allowed to do that.

But we can tell it we don't mind: the -funsafe-math-optimizations compiler flag tells the compiler we think it's okay if the rounding errors are a little different in FP calculations.

gcc -O3 -funsafe-math-optimizations …

Unsafe Math

This can get us vectorization and other optimizations around floating point values where it wouldn't be possible otherwise.

For example, if we add -funsafe-math-optimizations to our floating point array sum, we get a vectorized compilation without any messing around with assembly or vectorclass or anything.

Unsafe Math

But don't treat the unsafe math flag as a magic command to the compiler to go faster.

You are asking for a change to the semantics of your code: that should be done with caution. Have you thought about the possibility for rounding errors and decided you're okay with it? If yes, then go ahead.

Unsafe Math

For example, there's an algorithm Kahan summation that sum values and is designed to reduce error. It essentially carries a second float through the summation that holds the error, and tries to include it in the sum as it goes.

It's algebraically-equivalent to normal summation (c would always be exactly 0 if floats were real numbers).

Unsafe Math

Because it's algebraically-equivalent to a normal summation, -funsafe-math-optimizations would give the compiler permission to unwind the algorithm into an implementation with the normal error.

If I was using Kahan summation for increased accuracy, I would think that compilation was incorrect.

(But note it's still slower than the simple summation: the code is complex enough that the optimizer doesn't really apply all the tricks.)

Unsafe Math

In GCC (and only GCC as far as I know) it's possible to ask for this optimization on a single function with a function attribute:

__attribute__((optimize("-funsafe-math-optimizations")))
double do_float_stuff(double a, double b) {
    …
}

That at least lets you give this permission on a piece of code smaller than a file.

Unsafe Math

The C++20 unsequenced execution policy seems to allow you to express in whatever order is fastest, ~~but I'm not seeing that speedup (in my compiler, today, with the command line I used)~~.

DATA_T vector_sum_4(vector<DATA_T> vec) {
    return std::reduce(
        std::execution::unseq,
        vec.cbegin(), vec.cend(),
        (DATA_T)0.0,
        [](const auto & x, const auto & y) {
            return x + y;
        }
    );
}

That seems to use SIMD instructions with Clang ?

Unsafe Math

Update right before I have to leave for lecture: with gcc -O3 and std::execution::unseq (vector_sum_4) I see performance very close to the manual-SIMD vectorclass implementation (array_sum_6).

vector_sum_1 -96418.9 in 10.1912
vector_sum_2 -96418.9 in 10.2783
vector_sum_3 -96415.3 in 8.29111
vector_sum_4 -96405.6 in 7.41596
array_sum_5  -96418.9 in 10.1734
array_sum_6  -96415.2 in 7.221
array_sum_7  -96418.9 in 10.1876

Crazy Compilations

Up to now, I would describe the compiler output we have seen as helping us good assembly implementations of what we expressed in C.

It does weirder stuff than you probably thought…

Crazy Compilations

Remember Kernighan's popcount algorithm?

uint8_t popcount2(uint64_t n) {
    uint8_t count = 0;
    while (n > 0) {
        count += 1;
        n &= n - 1;          // clear the lowest set bit
    }
    return count;
}

Now there's an instruction for that, but decades of code exists from when that wasn't true.

Crazy Compilations

It turns out GCC recognizes that pattern and turns it into the popcnt instruction. So does clang .

The compiler doesn't have to respect the big-O running time of our algorithm, only the result our function produces.

Crazy Compilations

Consider this recursive function that calculates \(f b^p\) using the observation that \(f b^p = fb b^{p-1}\):

uint64_t power_1(uint64_t f, uint64_t b, uint64_t p) {
    if (p == 0) { // f * b^0 == f
        return f;
    } else {      // f * b^p == (b*f) * b^(p-1)
        return power_1(b * f, b, p - 1);
    }
}

It happens to be tail recursive: the return value in the recursive case is just the recursive call (with no other calculation around it).

Crazy Compilations

A tail recursive function can be converted into a loop (roughly: instead of calling the function, just jump back up to the top of the function). In this case, this is equivalent:

uint64_t power_2(uint64_t f, uint64_t b, uint64_t p) {
    while (p > 0) {
        f *= b;
        p -= 1;
    }
    return f;
}

Crazy Compilations

Those compile to the same assembly with gcc -O2 , or with clang -O1 .

Again, the compiler doesn't care about our algorithm, only our result.

Crazy Compilations

Want something weirder?

More math I know: \(a + b = (a+1) + (b-1)\). I thought this example would be the same as the previous one:

uint64_t add_1(uint64_t a, uint64_t b) {
    if (b == 0) { // a + 0 == a
        return a;
    } else {      // a + b == (a+1) + (b-1)
        return add_1(a + 1, b - 1);
    }
}

Crazy Compilations

… and it would compile with logic like this:

uint64_t add_2(uint64_t a, uint64_t b) {
    while (b != 0) {
        b -= 1;
        a += 1;
    }
    return a;
}

Nope. They both compile like this:

uint64_t add_3(uint64_t a, uint64_t b) {
    return a + b;
}

Back to Reality

Those examples are interesting, and show how far the optimizer is willing to go to give you a good implementation, but they aren't typical.

In particular, the more complex the code gets, the less likely the optimizer is going to be able to recognize patterns and transform them.

Back to Reality

General advice:

Write simple, readable, idiomatic code. That's what the compiler is expecting.
If performance matters, look at the assembly being generated. Compare with your expectations.
Only then, consider trying to trick the compiler into producing something different.

Back to Reality

I'll leave you to ponder the compilation of this code:

bool is_space(char c) {
    return c==' ' || c=='\r' || c=='\n' || c=='\t';
}

That doesn't use an OR, or a branch, or a conditional move.

Hint 1, hint 2. [Example from Godbolt @45:00.]