We have seen a lot of C (and C++) through this course.
It was mostly doing two things: giving us a way to call assembly code, and acting as a basis to compare the performance of our assembly code.
Usually the compiler proved itself to be a better assembly programmer than us. The C was almost always faster than (or similar to) our hand-written assembly. The only two exceptions (I think, on most processors):
Performance of the hailstone example was close to the conditional branch. A slightly improved branch predictor, or improved pipeline with a lower misprediction penalty would negate the cmov
benefit in this case.
For the floating point code, it would be nice if we could express assume
.
is associative: I don't care about the exact rounding+
We have seen that C can usually get the same performance as hand-written assembly (or better). How can we make sure it always does?
There are many ways to write C with bad performance: bad algorithms (e.g. exponential Fibonacci algorithm, bubble sort), unpredictable branches, bad memory access (poor locality = cache misses), bad data layout (e.g. array of structs/objects), …
Now that we know those pitfalls, we can avoid them (or at least make an informed decision about avoiding/ignoring them).
Almost always, having an idea what assembly the compiler will write for us is more important that hand-writing that assembly.
Of course, this is often fairly straightforward. We have a pretty good idea that this C:
int64_t f(int64_t num) { return 32 * num; }
… will turn into assembly like this:
f: mov %rdi, %rax sal $5, %rax ret
… and won't worry too much about it when writing C.
What if we're not sure?
e.g. do C++ vector
operations get turned into SIMD instructions or not?
int32_t sum_vector(std::vector<int32_t> vec) { int32_t total = 0; for (uint64_t i = 0; i < vec.size(); i++) { total += vec[i]; } return total; }
Of course we could look. Compiling with -S
will get us the assembly output. Machine-generated code is never beautiful but somewhere in there, I see:
.L6: vpaddd (%rax), %ymm1, %ymm1 addq $32, %rax cmpq %rdx, %rax jne .L6
The vpaddd
instruction is add packed 32-bit integers
. Yes, it's a SIMD operation.
If we change the data type to float
, the work is done by addss
, a scalar operation.
Looking at gcc -S
output directly is annoying. Luckily other people have been annoyed by it and helped us.
The Compiler Explorer at godbolt.org will run our code through a compiler and show us the assembly output, helpfully annotated.
Let's have a look at the sum_vector
code in godbolt.org.
General notes:
Output…, you can select AT&T/Intel assembly syntax.
In the case of our
code:sum_vector
vpaddd
.On the compilation:
-O1
creates much more readable assembly that's closer like what I wrote in C++.clang -O3
(or -O2
) gets a much more aggressive compilation: partially-unrolled loop with several accumulators to (presumably) avoid data stalls. It doesn't make a speed difference on my processor, but maybe on some?Other notes:
Add newto get another compiler window for side-by-side comparison.
Output.
The Compiler Explorer is an extremely valuable tool (and we thank Matt Godbolt for all of his work on it).
If you take nothing else from this course aside from being able to read and understand its output, then it's not a total loss.
Of course, knowing what assembly output you want requires knowing a lot of what we have discussed.
If we don't like the assembly the compiler is producing, what can we do about it?
First: pay attention to compiler optimization (more later).
If you really want to control the assembly output, it's possible to write inline assembly in C and C++. It should be considered a last-resort, but it can be done.
When inserting raw assembly code, we have to give the compiler enough information to integrate it with the code it wrote (and registers its using, etc).
Besides the assembly, we need to say which C variables we're changing, reading, and which registers we're going to overwrite the contents of.
__asm__( assembly_code : output_operands : input_operands : registers_used );
Let's try it. Create a few C variables and do some calculations in assembly…
uint64_t a, b, c, d; a = 100; b = 200; c = 3; d = 4; __asm__( "mov $5000, %%rax\n" "mov %%rax, %0\n" "add %2, %0\n" "mov %3, %1\n" : "+r"(c), "+r"(d) // %0 and %1 in the assembly : "r"(a), "r"(b) // %2 and %3 in the assembly : "rax" // %rax is clobbered ); assert(a == 100 && b == 200 && c == 5100 && d == 200);
Inline assembly is almost certainly a bad idea: it makes your C code non-portable (i.e. won't work on ARM); has all of the danger of assembly, plus worse interactions with the compiler; is almost certainly achievable some other way.
We saw the vectorclass library: it let us express ourselves in C++, but in a way that very carefully directs the assembly output.
Usually we can just rely on the compiler and optimizer to do its thing, but if you really care about every clock cycle, libraries like that might be worth it.
If we look at the implementation of vectorclass, in some sense it's very simple.
e.g. the code to implement +
on two Vec8f
is just an operator overload (a C++ operator+
definition) that makes one call, to _mm256_add_ps
.
What's that?
It's a compiler intrinsic: a function that has some special meaning to the compiler. Typically, they mean use this assembly instruction, or some fallback if it's not available
.
In this case, the _mm256_add_ps
instruction adds float32 vectors. The corresponding… instruction is VADDPS.
So, it's a way to get single-precision SIMD addition
And godbolt confirms that _mm256_add_ps
gets us exactly the vaddps
instruction.
In this case, without -march=haswell
, compilation fails: for this intrinsic, it's that instruction or nothing.
Earlier in the course, we met __builtin_popcountll
, an intrinsic for counting the number of set bits.
uint8_t popcount3(uint64_t n) { return (uint8_t)__builtin_popcountll(n); }
I said this was an intrinsic that requested the popcntq
instruction, and I wasn't lying .
If we remove the -march=haswell
when compiling, we get
for some other implementation provided by the compiler.call __popcountdi2
But compiler intrinsics and inline assembly are still not the tool you need for most code.
Mostly, the compiler is smart enough to turn your code into something fast. How smart is it, and how can we help it?
The optimizer will do its best to transform what you wrote into something fast (or if you ask, small).
We have been accessing compiler optimization all semester by adding switches like
to our compiler command line. Each of the three -O1
levels
of optimizations are a collection of specific optimizations that the compiler can be done to code.
For example, there's an optimization flag -finline-functions
that is enabled as part of -O2
(and therefore also -O3
).
We could either select one of the packs of optimizations:
gcc -O1 file.c
Or turn on those optimizations individually:
gcc -fauto-inc-dec -fbranch-count-reg … file.c
The optimizations make compilation take more time, but at least the lower levels are generally worth it.
Roughly,
-O0
: no optimization (the default).-O1
: basic optimizations that don't take that much more compile time.-O2
: most optimizations.-O3
: aggressively optimize, even if it greatly increases compile time or executable size.-Og
: optimize for debugging, so assembly behaviour maps closely to the C code. ≈-O1
-Os
: optimize for smaller executable. ≈-O2
Historically, -O3
could sometimes cause worse performance because it's so aggressive with its choices. That's less true now (but some of you saw it on Lab 9.)
The traditional advice was to use -O2
for release
because it was fast but safe.
I have been suggesting -O3
because it turns on auto-vectorization (later).
Typically, Clang is a little more aggressive about what it does in the lower optimization levels.
Everything we're about to see is true in optimizing compilers in general, with a few differences about what happens at -O1
vs -O2
vs -O3
.
Let's start with some basics. Before this course, you probably thought variables lived in memory, maybe on the stack.
int64_t variable_usage(int64_t x, int64_t y) { int64_t a = x + 54941 * y; int64_t b = -a * y; return b; }
They might (and do with no optimization ), or they might be kept in registers (and are with -O1
).
But note in the -O1
compilation: the code doesn't actually calculate or store the variable I called
. It saved one instruction by calculating the negation of a
a
, as if I wrote:
int64_t algebra(int64_t x, int64_t y) { int64_t negative_a = -54941 * y - x; int64_t b = negative_a * y; return b; }
The compiler is also happy to do evaluate constant expressions at compile time.
There's no need to get out a calculator when writing code. Write expressive code and Let the compiler do the arithmetic .
The basic tricks we had to do manually in assembly will be done by the compiler automatically.
int64_t const_mult_1(int64_t a) { return a * 16; } int64_t const_mult_1a(int64_t a) { return a << 4; }
Those compile to the same assembly .
And some more we didn't think of.
int64_t const_mult_2(int64_t a) { return a * 18; }
That compiles in an unexpected way :
leaq (%rdi,%rdi,8), %rax addq %rax, %rax ret
The lea
instruction basically does simple math. The address
referred to by (%rdi,%rdi,8)
is rdi + rdi*8
, or 9 times %rdi
.
So the compiler got an integer addition and multiplication and put the result in a different register by slightly-abusing lea
. That's clever. We might have written this instead of that lea
:
mov %rdi, %rax shl $3, %rdi add %rdi, %rax
We can try to outsmart the compiler and pre-optimize like this:
int32_t const_mult_3(int32_t a) { return a * 65599; // 65599 == 2^16 + 2^6 - 2^0 } int32_t const_mult_3a(int32_t a) { return (a << 16) + (a << 6) - a; }
Nope . The compiler decides the multiplication is cheaper than the shifts and adds, and fixes our over-eager and less-readable code. [Example from Matt Godbolt @31:00.]
The compiler will also do things like moving expressions out of a loop if they don't need to be repeated (loop-invariant code motion
). e.g. the
here:y*z
void loop_invariant_1(int* a, unsigned n, int y, int z) { for (unsigned i = 0; i < n; i++) { a[i] = 6 * i + y * z; } }
The multiplication is done outside the loop . [Example adapted from Wikipedia.]
The optimizer will sometimes replace a branch with a cmov
(infrequently, in my experience):
int cond_move_1(int x, int y) { int result; if (y <= 9) { result = x + 1; } else { result = x - 1; } return result; }
In a simple case like this, we get a conditional move , but not if things get a little more complex.
We saw (but didn't really talk about) the cost of calling a function: we at least have the cost of pushing the return address (by call
) and popping it (by ret
). Maybe also preserving registers, etc.
So, maybe we don't want to do a function call for every tiny function we write. The compiler can inline simple functions: basically insert their code in place of the function call.
double logistic(double x) { return 1 / (1 + exp(-x)); } void logistic_fill(double* arr, unsigned n) { for (unsigned i = 0; i < n; i++) { arr[i] = logistic((double)i); } }
The logistic
function, is not called at -O2
, but is at -O1
.
But exp
is called: it's defined in a separate file (math.h
and friends), so can't be inlined (also, it might not be worth it, depending on its definition).
The lesson, I think, in these examples: write beautiful, expressive, clear, idiomatic code. The optimizer can figure out how to rearrange it to be fast.
We would really like to be able to use the SIMD functionality of our processors without writing assembly or having to use low-level tools like intrinsics (or vectorclass or similar).
We have seen how much performance can be gained with SIMD instructions. Keeping an eye on how the compiler uses them on our code can make a big difference.
Some SIMD things happen at -O2
, but mostly at -O3
.
void add_four_double(double* a, double* b, double* __restrict c) { c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; }
Compiling this generates a vaddpd
instruction only if we add the __restrict
qualifier to the third argument.
Saying __restrict
promises that there's no way to modify that value from anywhere else: no aliases. If there were aliases (e.g. c == a + 1
), then turning this into a single vaddpd
would be incorrect.
We meet a limitation of the abstraction presented by the language: C wasn't designed to express vector-like thoughts, so it can take some work to convince the compiler it's okay. (Same for C++ and most other languages.)
We have seen that summing an array/vector:
DATA_T array_sum_1(DATA_T* array, uint64_t length) { DATA_T sum = 0; for (size_t i = 0; i < length; i++) { sum += array[i]; } return sum; }
… uses SIMD instructions for integer types but not floating point types, and not until -O3
.
That's going to be a fairly common pattern: you usually don't get SIMD stuff lower than -O3
and optimization options are more restricted for floating point operations.
It sure would be nice to be able to convince the compiler it's okay to rearrange floating point operations…
We saw before: the compiler won't rearrange/reorder/apply algebra to floating point operations because it can cause a different result. It's not allowed to do that.
But we can tell it we don't mind: the -funsafe-math-optimizations
compiler flag tells the compiler we think it's okay if the rounding errors are a little different in FP calculations.
gcc -O3 -funsafe-math-optimizations …
This can get us vectorization and other optimizations around floating point values where it wouldn't be possible otherwise.
For example, if we add
to our floating point array sum, we get a vectorized compilation without any messing around with assembly or vectorclass or anything.
But don't treat the unsafe math flag as a magic command to the compiler to go faster
.
You are asking for a change to the semantics of your code: that should be done with caution. Have you thought about the possibility for rounding errors and decided you're okay with it? If yes, then go ahead.
For example, there's an algorithm Kahan summation
that sum values and is designed to reduce error. It essentially carries a second float through the summation that holds the error, and tries to include it in the sum as it goes.
It's algebraically-equivalent to normal summation (c
would always be exactly 0 if floats were real numbers).
Because it's algebraically-equivalent to a normal summation, -funsafe-math-optimizations
would give the compiler permission to unwind the algorithm into an implementation with the normal error.
If I was using Kahan summation for increased accuracy, I would think that compilation was incorrect.
(But note it's still slower than the simple summation: the code is complex enough that the optimizer doesn't really apply all the tricks.)
In GCC (and only GCC as far as I know) it's possible to ask for this optimization on a single function with a function attribute:
__attribute__((optimize("-funsafe-math-optimizations"))) double do_float_stuff(double a, double b) { … }
That at least lets you give this permission on a piece of code smaller than a file.
The C++20 unsequenced execution policy seems to allow you to express in whatever order is fastest
, but I'm not seeing that speedup (in my compiler, today, with the command line I used).
DATA_T vector_sum_4(vector<DATA_T> vec) { return std::reduce( std::execution::unseq, vec.cbegin(), vec.cend(), (DATA_T)0.0, [](const auto & x, const auto & y) { return x + y; } ); }
Update right before I have to leave for lecture: with gcc -O3
std::execution::unseq
(vector_sum_4
) I see performance very close to the manual-SIMD vectorclass implementation (array_sum_6
).
vector_sum_1 -96418.9 in 10.1912 vector_sum_2 -96418.9 in 10.2783 vector_sum_3 -96415.3 in 8.29111 vector_sum_4 -96405.6 in 7.41596 array_sum_5 -96418.9 in 10.1734 array_sum_6 -96415.2 in 7.221 array_sum_7 -96418.9 in 10.1876
Up to now, I would describe the compiler output we have seen as helping us good assembly implementations of what we expressed in C
.
It does weirder stuff than you probably thought…
Remember Kernighan's popcount algorithm?
uint8_t popcount2(uint64_t n) { uint8_t count = 0; while (n > 0) { count += 1; n &= n - 1; // clear the lowest set bit } return count; }
Now there's an instruction for that, but decades of code exists from when that wasn't true.
It turns out GCC recognizes that pattern and turns it into the popcnt
instruction. So does clang .
The compiler doesn't have to respect the big-O running time of our algorithm, only the result our function produces.
Consider this recursive function that calculates \(f b^p\) using the observation that \(f b^p = fb b^{p-1}\):
uint64_t power_1(uint64_t f, uint64_t b, uint64_t p) { if (p == 0) { // f * b^0 == f return f; } else { // f * b^p == (b*f) * b^(p-1) return power_1(b * f, b, p - 1); } }
It happens to be tail recursive: the return value in the recursive case is just the recursive call (with no other calculation around it).
A tail recursive function can be converted into a loop (roughly: instead of calling the function, just jump back up to the top of the function). In this case, this is equivalent:
uint64_t power_2(uint64_t f, uint64_t b, uint64_t p) { while (p > 0) { f *= b; p -= 1; } return f; }
Those compile to the same assembly with gcc -O2
, or with clang -O1
.
Again, the compiler doesn't care about our algorithm, only our result.
Want something weirder?
More math I know: \(a + b = (a+1) + (b-1)\). I thought this example would be the same as the previous one:
uint64_t add_1(uint64_t a, uint64_t b) { if (b == 0) { // a + 0 == a return a; } else { // a + b == (a+1) + (b-1) return add_1(a + 1, b - 1); } }
… and it would compile with logic like this:
uint64_t add_2(uint64_t a, uint64_t b) { while (b != 0) { b -= 1; a += 1; } return a; }
Nope. They both compile like this:
uint64_t add_3(uint64_t a, uint64_t b) { return a + b; }
Those examples are interesting, and show how far the optimizer is willing to go to give you a good implementation, but they aren't typical.
In particular, the more complex the code gets, the less likely the optimizer is going to be able to recognize patterns and transform them.
General advice:
I'll leave you to ponder the compilation of this code:
bool is_space(char c) { return c==' ' || c=='\r' || c=='\n' || c=='\t'; }
That doesn't use an OR, or a branch, or a conditional move.
Hint 1, hint 2. [Example from Godbolt @45:00.]