We have seen many ways processors let us get a lot done per clock cycle: pipelining, superscalar architectures, memory caching.

Those are *mostly* not the programmer's problem. We might want to arrange our code to take maximal advantage of them, but they happen automatically to all of our code.

There are other ways to get more done per clock cycle, but they *are* the programmer's problem. Let's talk about a an important one…

There are many times you want to do the same operation on many values.

Consider the dot product from lab exercises: there are many multiply these things and then add to the total

operations on each pair of elements across the array.

It turns out this is something processors can do: single instruction, multiple data or SIMD operations.

The idea: give the processor vectors of values (∼small fixed-length array) as operands and ask it to do the same operation on all of them. If a processor can do a 4-value SIMD integer addition, we would expect this calculation to happen in **one instruction**:

Or in other words, this function could be implemented in **one** instruction.

void add_four_double(double* a, double* b, double* c) { c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; }

SIMD operations are data parallelism: we are working on multiple pieces of data concurrently. They are not concurrent in the sense of multithreading: we're still talking about one instruction at a time.

The `%xmm`

registers we have been using for floating point operations are the SIMD registers in x86-64.`n`

We have only been using the lower 64 bits, which is what the instructions we have been using work with (or 32 for single-precision operations). There was more space in those registers, and we could choose to use it…

Like with the general purpose registers, there are differently-sized aliases to the same data. (e.g. `%rax`

, `%eax`

, `%ax`

, `%al`

all refer to some of the same bits in the processor.)

For the SIMD registers, the `%xmm`

refer to 128-bit registers.`n`

The `%ymm`

registers are 256-bit extensions of those. i.e. `n``%xmm6`

is the lower 128 bits of `%ymm6`

.

A SIMD register can hold any of the types the processor can deal with, as many as will fit.

If we can do the same operation on 4, 8, 16 values with one instruction instead of a loop that runs that many times, that should be a *huge* speedup.

SIMD instructions were added to x86 with the SSE (Streaming SIMD Extensions) instructions (Pentium III, 1999) with 128-bit wide operations (`%xmm`

registers). More instructions were added in SSE2, SSE3, SSSE3, SSE4.`n`

The AVX (Advanced Vector Extensions) extension added 256-bit operations (`%ymm`

registers) and AVX2 added more instructions. That's why we have been demanding `n`Haswell or later

all semester: it's the start of AVX2 and we want the 256-bit SIMD functionality.

AVX-512 adds the `%zmm`

registers which are 512-bits wide, and a bunch more instructions, but you probably don't have it.`n`

Up to now, using the half-as-big data types (`int32_t`

instead of `int64_t`

; `float`

instead of `double`

) was a fairly marginal benefit: more would fit in cache, *maybe* the instructions that work on them would be a little faster.

But now: twice as many will fit in the SIMD registers, so one instruction can operate on twice as many.

Even more if we can use `int16_t`

or `int8_t`

(or half-precision float with AVX-512).

There are SIMD vector instructions for the operations we have been doing on single values (or scalars).

These are generally named the vector

instruction or a packed

operation (because the vector registers contain several values packed together).

Often the 256-bit vector versions have a

prefix.`v`

Most (all?) of the vector instructions take separate source and destination operands. Basically, instead of having to do this kind of operation,

`addsd %xmm0, %xmm1`

≈`xmm1 += xmm0;`

… we can do:

`vaddpd %ymm0, %ymm1, %ymm2`

≈`ymm2 = ymm0 + ymm1;`

In AT&T syntax, the destination is still the last operand.

For floating point values:

For integers:

`vpaddb/vpaddw/vpaddd/vpaddq`

:Packed ADD Byte/Word/Doubleword/Quadword (8/16/32/64-bit) Integers

`vpsubb/vpsubw/vpsubd`

/`vpsubq`

: that, but subtraction- … and multiply signed and unsigned.
- … and shift.
- … and compare.

You're going to be surprised how many move

instructions there are:

Why?

For the different types, there can be a small slowdown when crossing domains

. Roughly (I think) the scheduling of integer vs floating point SIMD operations is separate enough that mixing sides adds complexity.

There can also be a small speed penalty for reading memory that isn't aligned

: the address of the memory is/isn't divisible by the size of the register.

So all of those move instructions have the same effect, perhaps at a slightly different speed. You have my permission to use them interchangeably. Use the unaligned

ones unless you know something about the alignment of your data.

Another hint that these instructions aren't designed for humans.

Note that all of the move

instructions to get data into the SIMD registers rely on the data being adjacent in memory.

If what we have is an array of `int64_t`

or `float`

, then they work with data like we have it. We can load several values into a register and do some calculation. No problem.

But if we have an array of `struct`

or `class`

values, we don't have the memory layout we need.

struct rgb { uint8_t red; uint8_t green; uint8_t blue; };

If we want to do some calculation on many of those values (e.g. convert to HSV), it's hard to get all of the `red`

values into a SIMD register to start working with them.

So, an array-of-structs or array-of-objects is hard to work with using SIMD operations.

If we had an array of reds, another of greens, and another of blues, then it's easy.

I like classes/objects. If I need absolute maximum numerical performance, maybe I have a hard choice to make.

There are gather

instructions (e.g. `vgatherdps/vgatherqps`

) that can do addressing more like single-value assembly instructions (like `8(%rbp, %rcx, 4)`

). It might be able to get array-of-object values into SIMD registers.

They seem… challenging.

But let's push through and try our old friend summing an array

of `double`

with the floating point SIMD instructions…

When loading data into the `ymm`

registers, we'll read 32 adjacent bytes from memory. Then we can do vector operations to add up (what I think of as) columns.`n`

e.g. if we have 12 elements in the array, `a`

, `b`

, `c`

, …, `l`

, vector operations will get us:

Then we'll do the horizontal

add to get the final sum.

Also, I'm going to assume an array length divisible by four (or however many values are held in the registers for other types). We *could* deal with any remaining values, but let's simplify.

Our function will have this signature: both arguments go in the general purpose registers (`%rdi`

, `%rsi`

) and we return in `%xmm0`

.

double double_sum_2(double* array, uint64_t length);

The basic loop is easy enough: `%ymm0`

as a 4-part accumulator, `%rcx`

as a counter, move **four** steps along with each iteration.

double_sum_2: vxorpd %ymm0, %ymm0, %ymm0 mov $0, %rcx ds2_loop: cmp %rsi, %rcx jae ds2_return vaddpd (%rdi, %rcx, 8), %ymm0, %ymm0 add $4, %rcx jmp ds2_loop

vxorpd %ymm0, %ymm0, %ymm0

vaddpd (%rdi, %rcx, 8), %ymm0, %ymm0

Note the

instructions with the separate destination operand, and `v`

`%ymm`

registers.`n`

When we get to the end of the loop, it gets strangely tricky to finish the sum. The idiom seems to be first, copy the upper 128 bits to the lower 128 bits of another register, so everything is in an `%xmm`

register.`n`

The `vextractf128`

instruction does that. The first operand indicates which 128-bit chunk: we want the high 128 bits.

vextractf128 $0x1, %ymm0, %xmm1

vextractf128 $0x1, %ymm0, %xmm1

After that instruction:

The two `xmm`

registers can be added easily:

vaddpd %xmm1, %xmm0, %xmm0

Now:

Then we can shuffle

element 1 to position 0 in another register:

vshufpd $0b01, %xmm0, %xmm0, %xmm1

A final scalar add will get the sum we want:

vaddsd %xmm1, %xmm0, %xmm0

The full horizontal add

after the loop:

vextractf128 $0x1, %ymm0, %xmm1 vaddpd %xmm1, %xmm0, %xmm0 vshufpd $0b01, %xmm0, %xmm0, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 ret

This was not meant for people to write, but as an idiom, it's fairly re-usable.

Don't worry about the details of these instructions, just know what this code fragment does.

Did we actually get any speed for all that? The non-SIMD version (using `addsd`

for all the additions) took ∼2.4 cycles per element (on my processor).

This version (mostly `vaddpd`

and moving 4 steps per loop): ∼1.5 cycles per element.

A <2× speedup, not 4. Likely inference: the vector instructions are slightly slower than the scalar instructions, but we get lots more work done.

A C implementation: ∼2.5 cycles per element: we get to see the compiler's reluctance to rearrange our arithmetic (and the C version can deal with array lengths not divisible by 4).

Remember that *we* decided here to do the additions in a different order. That means different accumulation of error. Was that the right choice? Depends on the problem. See MACM 316.

What if we don't need the precision/range of double-precision? We can do the same thing with single precision and do **eight** additions per iteration.

The main loop isn't so different (`vaddp`

for single-precision addition; add 8 to counter):**s**

float_sum_2: vxorps %ymm0, %ymm0, %ymm0 mov $0, %rcx fs2_loop: cmp %rsi, %rcx jae fs2_return vaddps (%rdi, %rcx, 4), %ymm0, %ymm0 add $8, %rcx jmp fs2_loop

The horizontal add

is one step longer:

vextractf128 $0x1, %ymm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vshufps $0b00001110, %xmm0, %xmm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vshufps $0b00000001, %xmm0, %xmm0, %xmm1 vaddss %xmm1, %xmm0, %xmm0 ret

Or in a diagram:

The C equivalent on `float`

still takes ∼2.1 cycles per element, and a non-vector assembly (`addss`

) implementation also takes ∼2.1.

The single-precision SIMD version: ∼0.7 cycles per element.

Compared to double precision: do twice as many things per iteration; iterate half as many times; twice as fast.

How fast is 0.7 cycles per element?

This was a 5.5 GHz processor, so that's \(7.86\times 10^{9}\) single-precision values per second.

\[ (7.86\times 10^{9}\mbox{ values/s}) \times (4\mbox{ B/value})\\ = 32.4\mbox{ GB/s}\,. \]

That system's DDR5-6000 should have a peak transfer rate of 48 GB/s. The 32.4 GB/s processing is getting close to that.

The sum array

is a very minimal calculation: anything more complex (e.g. dot product) is going to be more processor-limited.

Aside: GDB can print the contents of a SIMD register, but it doesn't know the type of the contents so you have to.

(gdb) print $ymm0 $2 = {v16_bfloat16 = {-1.075e+08, 0.09961, -1.075e+08, 0.1992, -1.592e-23, 0.2988, -1.075e+08, 0.3984, 0, 0.5, -1.592e-23, 0.5977, 4.168e-08, 0.6992, -1.075e+08, 0.7969}, v16_half = { -19.203, 1.4492, -19.203, 1.5742, -0.0027351, 1.6494, -19.203, 1.6992, 0, 1.75, -0.0027351, 1.7744, 0.22498, 1.7998, -19.203, 1.8242}, v8_float = {0.100000001, 0.200000003, 0.300000012, 0.400000006, 0.5, 0.600000024, 0.699999988, 0.800000012}, v4_double = {1.3411048210842936e-08, 3.4332283476601335e-06, 9.7656287607605918e-05, 0.00087890645809238774}, v32_int8 = {-51, -52, -52, 61, -51, -52, 76, 62, -102, -103, -103, 62, -51, -52, -52, 62, 0, 0, 0, 63, -102, -103, 25, 63, 51, 51, 51, 63, -51, -52, 76, 63}, v16_int16 = {-13107, 15820, -13107, 15948, -26214, 16025, -13107, 16076, 0, 16128, -26214, 16153, 13107, 16179, -13107, 16204}, v8_int32 = {1036831949, 1045220557, 1050253722, 1053609165, 1056964608, 1058642330, 1060320051, 1061997773}, v4_int64 = {4489188110458735821, 4525216907491121562, 4546834186568204288, 4561245704520151859}, v2_int128 = {83475518170512110947896037946428280013, 84140132168590259879230348149687058432}}

These were single-precision float, so

(vector of 8 `v8_float`

`float`

) was the right interpretation.

Actually writing assembly with the SIMD instructions becomes complex quickly.

There are a few reasons you don't generally need to…

The compiler might help [later topic: “compiler optimization”]. You might also be able to use a library to help.

As an example, Agner Fog's vectorclass library. It lets you explicitly access SIMD functionality from C++, in a way that can be directly translated to SIMD instructions.

I find it easiest to clone the Git repository, or download+unpack.

git clone --branch v2.02.01 https://github.com/vectorclass/version2 vectorclass

Then in your code,

#include "vectorclass.h"

And when compiling, search that directory:

g++ -Ivectorclass …

The library provides C++ types to represent values packed into the SIMD registers. The 256-bit-wide types we're most interested in:

Type | Contents | Count |
---|---|---|

`Vec4q` | `int64_t` | 4 |

`Vec8i` | `int32_t` | 8 |

`Vec16s` | `int16_t` | 16 |

`Vec4d` | `double` | 4 |

`Vec8f` | `float` | 8 |

Any of those types can be initialized by a constructor that takes either a single constant to put the same value in all of the fields (i.e. broadcast), or `n` values to set each field.

Vec8f acc = Vec8f(0.0); Vec8f numbers = Vec8f(0, 1, 2, 3, 4, 5, 6, 7);

For basic operations, the operators you expect are overloaded to do element-by-element operations on the values.

e.g. these represent eight additions and multiplications:

acc += numbers; acc = acc * numbers;

With the values initialized as on the last slide, we expect `acc`

to be these eight single-precision values:

0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0

Is it? We can index individual values out of the vector and look:

void print_float_vec(Vec8f v) { cout << "{ "; for( unsigned i=0; i < v.size(); i++ ) { cout << v[i] << ' '; } cout << "}\n"; }

Yes.

{ 0 1 4 9 16 25 36 49 }

Operations with this library are still very low-level: we're usually getting one assembly instruction per C++ operation. (e.g. still have to write the loops, deal with the end of the array if length not divisible by the vector register size, etc.)

But they're a lot easier to spell, and we get type checks to catch many mistakes.

e.g. this will fail to compile.

Vec8f singles = …; Vec4d doubles = …; doubles += singles;

error: no match for ‘operator+=’ (operand types are ‘Vec4d’ and ‘Vec8f’)

Analogous assembly would assemble, run, and produce incoherent results.

The types have a `.load`

method that can load values from a memory location into a SIMD register (the `vmovupd`

instruction or similar). They take a pointer to the corresponding number/type of values:

double array[8] = {0, 10, 20, 30, 40, 50, 60, 70}; Vec4d a, b; a.load(array); b.load(array + 4);

After that, we expect `a`

and `b`

to contain:

{ 0 10 20 30 } { 40 50 60 70 }

Also `.store()`

to write a register to memory.

There are convenience functions for useful operations. Most notably, `horizontal_add`

which adds the values from the register. With it, we can re-implement the sum of array values in C++…

float array_sum_vc(float* array, uint64_t length) { assert(length % 8 == 0); Vec8f acc = Vec8f(0); Vec8f tmp; for (uint64_t i = 0; i < length; i += 8) { tmp.load(array + i); acc += tmp; } return horizontal_add(acc); }

Notes: assumes length is divisible by 8; must explicitly load values from memory into (we assume) a `%ymm`

register.`n`

With that implementation, we get the same performance as the hand-written SIMD assembly (and in fact, almost exactly the same assembly).

A similar story on ARM in Rust: Using SIMD for Parallel Processing in Rust. They first rely on the compiler [later topic: “Auto Vectorization”], then use intrinsics, then the not-yet-stable `std::simd`

module.

Of course, not all problems lend themselves to using SIMD instructions, but it can be surprising how many do.

If those problems occur in your life, examining their implementation for vector-based solutions can be a huge speed win.