We have seen many ways processors let us get a lot done per clock cycle: pipelining, superscalar architectures, memory caching.
Those are mostly not the programmer's problem. We might want to arrange our code to take maximal advantage of them, but they happen automatically to all of our code.
There are other ways to get more done per clock cycle, but they are the programmer's problem. Let's talk about a an important one…
There are many times you want to do the same operation on many values.
Consider the dot product from lab exercises: there are many multiply these things and then add to the total
operations on each pair of elements across the array.
It turns out this is something processors can do: single instruction, multiple data or SIMD operations.
The idea: give the processor vectors of values (∼small fixed-length array) as operands and ask it to do the same operation on all of them. If a processor can do a 4-value SIMD integer addition, we would expect this calculation to happen in one instruction:
Or in other words, this function could be implemented in one instruction.
void add_four_double(double* a, double* b, double* c) { c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; }
SIMD operations are data parallelism: we are working on multiple pieces of data concurrently. They are not concurrent in the sense of multithreading: we're still talking about one instruction at a time.
The %xmmn
registers we have been using for floating point operations are the SIMD registers in x86-64.
We have only been using the lower 64 bits, which is what the instructions we have been using work with (or 32 for single-precision operations). There was more space in those registers, and we could choose to use it…
Like with the general purpose registers, there are differently-sized aliases to the same data. (e.g. %rax
, %eax
, %ax
, %al
all refer to some of the same bits in the processor.)
For the SIMD registers, the %xmmn
refer to 128-bit registers.
The %ymmn
registers are 256-bit extensions of those. i.e. %xmm6
is the lower 128 bits of %ymm6
.
A SIMD register can hold any of the types the processor can deal with, as many as will fit.
If we can do the same operation on 4, 8, 16 values with one instruction instead of a loop that runs that many times, that should be a huge speedup.
SIMD instructions were added to x86 with the SSE (Streaming SIMD Extensions) instructions (Pentium III, 1999) with 128-bit wide operations (%xmmn
registers). More instructions were added in SSE2, SSE3, SSSE3, SSE4.
The AVX (Advanced Vector Extensions) extension added 256-bit operations (%ymmn
registers) and AVX2 added more instructions. That's why we have been demanding Haswell or later
all semester: it's the start of AVX2 and we want the 256-bit SIMD functionality.
AVX-512 adds the %zmmn
registers which are 512-bits wide, and a bunch more instructions, but you probably don't have it.
Up to now, using the half-as-big data types (int32_t
instead of int64_t
; float
instead of double
) was a fairly marginal benefit: more would fit in cache, maybe the instructions that work on them would be a little faster.
But now: twice as many will fit in the SIMD registers, so one instruction can operate on twice as many.
Even more if we can use int16_t
or int8_t
(or half-precision float with AVX-512).
There are SIMD vector instructions for the operations we have been doing on single values (or scalars).
These are generally named the vector
instruction or a packed
operation (because the vector registers contain several values packed together).
Often the 256-bit vector versions have a
prefix.v
Most (all?) of the vector instructions take separate source and destination operands. Basically, instead of having to do this kind of operation,
addsd %xmm0, %xmm1
≈xmm1 += xmm0;
… we can do:
vaddpd %ymm0, %ymm1, %ymm2
≈ymm2 = ymm0 + ymm1;
In AT&T syntax, the destination is still the last operand.
For floating point values:
For integers:
vpaddb/vpaddw/vpaddd/vpaddq
: Packed ADD Byte/Word/Doubleword/Quadword (8/16/32/64-bit) Integers
vpsubb/vpsubw/vpsubd
/vpsubq
: that, but subtractionYou're going to be surprised how many move
instructions there are:
Why?
For the different types, there can be a small slowdown when crossing domains
. Roughly (I think) the scheduling of integer vs floating point SIMD operations is separate enough that mixing sides adds complexity.
There can also be a small speed penalty for reading memory that isn't aligned
: the address of the memory is/isn't divisible by the size of the register.
So all of those move instructions have the same effect, perhaps at a slightly different speed. You have my permission to use them interchangeably. Use the unaligned
ones unless you know something about the alignment of your data.
Another hint that these instructions aren't designed for humans.
Note that all of the move
instructions to get data into the SIMD registers rely on the data being adjacent in memory.
If what we have is an array of int64_t
or float
, then they work with data like we have it. We can load several values into a register and do some calculation. No problem.
But if we have an array of struct
or class
values, we don't have the memory layout we need.
struct rgb { uint8_t red; uint8_t green; uint8_t blue; };
If we want to do some calculation on many of those values (e.g. convert to HSV), it's hard to get all of the red
values into a SIMD register to start working with them.
So, an array-of-structs or array-of-objects is hard to work with using SIMD operations.
If we had an array of reds, another of greens, and another of blues, then it's easy.
I like classes/objects. If I need absolute maximum numerical performance, maybe I have a hard choice to make.
There are gather
instructions (e.g. vgatherdps/vgatherqps
) that can do addressing more like single-value assembly instructions (like 8(%rbp, %rcx, 4)
). It might be able to get array-of-object values into SIMD registers.
They seem… challenging.
But let's push through and try our old friend summing an array
of double
with the floating point SIMD instructions…
When loading data into the ymmn
registers, we'll read 32 adjacent bytes from memory. Then we can do vector operations to add up (what I think of as) columns.
e.g. if we have 12 elements in the array, a
, b
, c
, …, l
, vector operations will get us:
Then we'll do the horizontal
add to get the final sum.
Also, I'm going to assume an array length divisible by four (or however many values are held in the registers for other types). We could deal with any remaining values, but let's simplify.
Our function will have this signature: both arguments go in the general purpose registers (%rdi
, %rsi
) and we return in %xmm0
.
double double_sum_2(double* array, uint64_t length);
The basic loop is easy enough: %ymm0
as a 4-part accumulator, %rcx
as a counter, move four steps along with each iteration.
double_sum_2: vxorpd %ymm0, %ymm0, %ymm0 mov $0, %rcx ds2_loop: cmp %rsi, %rcx jae ds2_return vaddpd (%rdi, %rcx, 8), %ymm0, %ymm0 add $4, %rcx jmp ds2_loop
vxorpd %ymm0, %ymm0, %ymm0
vaddpd (%rdi, %rcx, 8), %ymm0, %ymm0
Note the
instructions with the separate destination operand, and v
%ymmn
registers.
When we get to the end of the loop, it gets strangely tricky to finish the sum. The idiom seems to be first, copy the upper 128 bits to the lower 128 bits of another register, so everything is in an %xmmn
register.
The vextractf128
instruction does that. The first operand indicates which 128-bit chunk: we want the high 128 bits.
vextractf128 $0x1, %ymm0, %xmm1
vextractf128 $0x1, %ymm0, %xmm1
After that instruction:
The two xmm
registers can be added easily:
vaddpd %xmm1, %xmm0, %xmm0
Now:
Then we can shuffle
element 1 to position 0 in another register:
vshufpd $0b01, %xmm0, %xmm0, %xmm1
A final scalar add will get the sum we want:
vaddsd %xmm1, %xmm0, %xmm0
The full horizontal add
after the loop:
vextractf128 $0x1, %ymm0, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 vshufpd $0b01, %xmm0, %xmm0, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 ret
This was not meant for people to write, but as an idiom, it's fairly re-usable.
Don't worry about the details of these instructions, just know what this code fragment does.
Did we actually get any speed for all that? The non-SIMD version (using addsd
for all the additions) took ∼2.4 cycles per element (on my processor).
This version (mostly vaddpd
and moving 4 steps per loop): ∼1.5 cycles per element.
A <2× speedup, not 4. Likely inference: the vector instructions are slightly slower than the scalar instructions, but we get lots more work done.
A C implementation: ∼2.5 cycles per element: we get to see the compiler's reluctance to rearrange our arithmetic (and the C version can deal with array lengths not divisible by 4).
Remember that we decided here to do the additions in a different order. That means different accumulation of error. Was that the right choice? Depends on the problem. See MACM 316.
What if we don't need the precision/range of double-precision? We can do the same thing with single precision and do eight additions per iteration.
The main loop isn't so different (vaddps
for single-precision addition; add 8 to counter):
float_sum_2: vxorps %ymm0, %ymm0, %ymm0 mov $0, %rcx fs2_loop: cmp %rsi, %rcx jae fs2_return vaddps (%rdi, %rcx, 4), %ymm0, %ymm0 add $8, %rcx jmp fs2_loop
The horizontal add
is one step longer:
vextractf128 $0x1, %ymm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vshufps $0b00001110, %xmm0, %xmm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vshufps $0b00000001, %xmm0, %xmm0, %xmm1 vaddss %xmm1, %xmm0, %xmm0 ret
Or in a diagram:
The C equivalent on float
still takes ∼2.1 cycles per element, and a non-vector assembly (addss
) implementation also takes ∼2.1.
The single-precision SIMD version: ∼0.7 cycles per element.
Compared to double precision: do twice as many things per iteration; iterate half as many times; twice as fast.
How fast is 0.7 cycles per element?
This was a 5.5 GHz processor, so that's \(7.86\times 10^{9}\) single-precision values per second.
\[ (7.86\times 10^{9}\mbox{ values/s}) \times (4\mbox{ B/value})\\ = 32.4\mbox{ GB/s}\,. \]
That system's DDR5-6000 should have a peak transfer rate of 48 GB/s. The 32.4 GB/s processing is getting close to that.
The sum array
is a very minimal calculation: anything more complex (e.g. dot product) is going to be more processor-limited.
Aside: GDB can print the contents of a SIMD register, but it doesn't know the type of the contents so you have to.
(gdb) print $ymm0 $2 = {v16_bfloat16 = {-1.075e+08, 0.09961, -1.075e+08, 0.1992, -1.592e-23, 0.2988, -1.075e+08, 0.3984, 0, 0.5, -1.592e-23, 0.5977, 4.168e-08, 0.6992, -1.075e+08, 0.7969}, v16_half = { -19.203, 1.4492, -19.203, 1.5742, -0.0027351, 1.6494, -19.203, 1.6992, 0, 1.75, -0.0027351, 1.7744, 0.22498, 1.7998, -19.203, 1.8242}, v8_float = {0.100000001, 0.200000003, 0.300000012, 0.400000006, 0.5, 0.600000024, 0.699999988, 0.800000012}, v4_double = {1.3411048210842936e-08, 3.4332283476601335e-06, 9.7656287607605918e-05, 0.00087890645809238774}, v32_int8 = {-51, -52, -52, 61, -51, -52, 76, 62, -102, -103, -103, 62, -51, -52, -52, 62, 0, 0, 0, 63, -102, -103, 25, 63, 51, 51, 51, 63, -51, -52, 76, 63}, v16_int16 = {-13107, 15820, -13107, 15948, -26214, 16025, -13107, 16076, 0, 16128, -26214, 16153, 13107, 16179, -13107, 16204}, v8_int32 = {1036831949, 1045220557, 1050253722, 1053609165, 1056964608, 1058642330, 1060320051, 1061997773}, v4_int64 = {4489188110458735821, 4525216907491121562, 4546834186568204288, 4561245704520151859}, v2_int128 = {83475518170512110947896037946428280013, 84140132168590259879230348149687058432}}
These were single-precision float, so
(vector of 8 v8_float
float
) was the right interpretation.
Actually writing assembly with the SIMD instructions becomes complex quickly.
There are a few reasons you don't generally need to…
The compiler might help [later topic: “compiler optimization”]. You might also be able to use a library to help.
As an example, Agner Fog's vectorclass library. It lets you explicitly access SIMD functionality from C++, in a way that can be directly translated to SIMD instructions.
I find it easiest to clone the Git repository, or download+unpack.
git clone --branch v2.02.01 https://github.com/vectorclass/version2 vectorclass
Then in your code,
#include "vectorclass.h"
And when compiling, search that directory:
g++ -Ivectorclass …
The library provides C++ types to represent values packed into the SIMD registers. The 256-bit-wide types we're most interested in:
Type | Contents | Count |
---|---|---|
Vec4q | int64_t | 4 |
Vec8i | int32_t | 8 |
Vec16s | int16_t | 16 |
Vec4d | double | 4 |
Vec8f | float | 8 |
Any of those types can be initialized by a constructor that takes either a single constant to put the same value in all of the fields (i.e. broadcast), or n values to set each field.
Vec8f acc = Vec8f(0.0); Vec8f numbers = Vec8f(0, 1, 2, 3, 4, 5, 6, 7);
For basic operations, the operators you expect are overloaded to do element-by-element operations on the values.
e.g. these represent eight additions and multiplications:
acc += numbers; acc = acc * numbers;
With the values initialized as on the last slide, we expect acc
to be these eight single-precision values:
0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0
Is it? We can index individual values out of the vector and look:
void print_float_vec(Vec8f v) { cout << "{ "; for( unsigned i=0; i < v.size(); i++ ) { cout << v[i] << ' '; } cout << "}\n"; }
Yes.
{ 0 1 4 9 16 25 36 49 }
Operations with this library are still very low-level: we're usually getting one assembly instruction per C++ operation. (e.g. still have to write the loops, deal with the end of the array if length not divisible by the vector register size, etc.)
But they're a lot easier to spell, and we get type checks to catch many mistakes.
e.g. this will fail to compile.
Vec8f singles = …; Vec4d doubles = …; doubles += singles;
error: no match for ‘operator+=’ (operand types are ‘Vec4d’ and ‘Vec8f’)
Analogous assembly would assemble, run, and produce incoherent results.
The types have a .load
method that can load values from a memory location into a SIMD register (the vmovupd
instruction or similar). They take a pointer to the corresponding number/type of values:
double array[8] = {0, 10, 20, 30, 40, 50, 60, 70}; Vec4d a, b; a.load(array); b.load(array + 4);
After that, we expect a
and b
to contain:
{ 0 10 20 30 } { 40 50 60 70 }
Also .store()
to write a register to memory.
There are convenience functions for useful operations. Most notably, horizontal_add
which adds the values from the register. With it, we can re-implement the sum of array values in C++…
float array_sum_vc(float* array, uint64_t length) { assert(length % 8 == 0); Vec8f acc = Vec8f(0); Vec8f tmp; for (uint64_t i = 0; i < length; i += 8) { tmp.load(array + i); acc += tmp; } return horizontal_add(acc); }
Notes: assumes length is divisible by 8; must explicitly load values from memory into (we assume) a %ymmn
register.
With that implementation, we get the same performance as the hand-written SIMD assembly (and in fact, almost exactly the same assembly).
A similar story on ARM in Rust: Using SIMD for Parallel Processing in Rust. They first rely on the compiler [later topic: “Auto Vectorization”], then use intrinsics, then the not-yet-stable std::simd
module.
Of course, not all problems lend themselves to using SIMD instructions, but it can be surprising how many do.
If those problems occur in your life, examining their implementation for vector-based solutions can be a huge speed win.