Data Parallelism

Data Parallelism

We have seen many ways processors let us get a lot done per clock cycle: pipelining, superscalar architectures, memory caching.

Those are mostly not the programmer's problem. We might want to arrange our code to take maximal advantage of them, but they happen automatically to all of our code.

Data Parallelism

There are other ways to get more done per clock cycle, but they are the programmer's problem. Let's talk about a an important one…

SIMD

There are many times you want to do the same operation on many values.

Consider the dot product from lab exercises: there are many multiply these things and then add to the total operations on each pair of elements across the array.

It turns out this is something processors can do: single instruction, multiple data or SIMD operations.

SIMD

The idea: give the processor vectors of values (∼small fixed-length array) as operands and ask it to do the same operation on all of them. If a processor can do a 4-value SIMD integer addition, we would expect this calculation to happen in one instruction:

SIMD operation

SIMD

Or in other words, this function could be implemented in one instruction.

void add_four_double(double* a, double* b, double* c) {
    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1];
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];
}

SIMD

SIMD operations are data parallelism: we are working on multiple pieces of data concurrently. They are not concurrent in the sense of multithreading: we're still talking about one instruction at a time.

SIMD

The %xmmn registers we have been using for floating point operations are the SIMD registers in x86-64.

We have only been using the lower 64 bits, which is what the instructions we have been using work with (or 32 for single-precision operations). There was more space in those registers, and we could choose to use it…

SIMD

Like with the general purpose registers, there are differently-sized aliases to the same data. (e.g. %rax, %eax, %ax, %al all refer to some of the same bits in the processor.)

For the SIMD registers, the %xmmn refer to 128-bit registers.

The %ymmn registers are 256-bit extensions of those. i.e. %xmm6 is the lower 128 bits of %ymm6.

SIMD

A SIMD register can hold any of the types the processor can deal with, as many as will fit.

SIMD register

SIMD

If we can do the same operation on 4, 8, 16 values with one instruction instead of a loop that runs that many times, that should be a huge speedup.

SIMD

SIMD instructions were added to x86 with the SSE (Streaming SIMD Extensions) instructions (Pentium III, 1999) with 128-bit wide operations (%xmmn registers). More instructions were added in SSE2, SSE3, SSSE3, SSE4.

SIMD

The AVX (Advanced Vector Extensions) extension added 256-bit operations (%ymmn registers) and AVX2 added more instructions. That's why we have been demanding Haswell or later all semester: it's the start of AVX2 and we want the 256-bit SIMD functionality.

AVX-512 adds the %zmmn registers which are 512-bits wide, and a bunch more instructions, but you probably don't have it.

SIMD

Up to now, using the half-as-big data types (int32_t instead of int64_t; float instead of double) was a fairly marginal benefit: more would fit in cache, maybe the instructions that work on them would be a little faster.

But now: twice as many will fit in the SIMD registers, so one instruction can operate on twice as many.

Even more if we can use int16_t or int8_t (or half-precision float with AVX-512).

SIMD Instructions

There are SIMD vector instructions for the operations we have been doing on single values (or scalars).

These are generally named the vector instruction or a packed operation (because the vector registers contain several values packed together).

Often the 256-bit vector versions have a v prefix.

SIMD Instructions

Most (all?) of the vector instructions take separate source and destination operands. Basically, instead of having to do this kind of operation,

addsd %xmm0, %xmm1
xmm1 += xmm0;

… we can do:

vaddpd %ymm0, %ymm1, %ymm2
ymm2 = ymm0 + ymm1;

In AT&T syntax, the destination is still the last operand.

SIMD Instructions

For floating point values:

SIMD Instructions

For integers:

SIMD Instructions

You're going to be surprised how many move instructions there are:

  • vmovupd: MOVe Unaligned Packed Doubles
  • vmovapd: MOVe Aligned Packed Doubles
  • vmovups: MOVe Unaligned Packed Singles
  • vmovaps: MOVe Aligned Packed Singles
  • vmovdqa: MOVe Double Quadword integers, Aligned
  • vmovdqu: MOVe Double Quadword integers, Unaligned

SIMD Instructions

Why?

For the different types, there can be a small slowdown when crossing domains. Roughly (I think) the scheduling of integer vs floating point SIMD operations is separate enough that mixing sides adds complexity.

There can also be a small speed penalty for reading memory that isn't aligned: the address of the memory is/isn't divisible by the size of the register.

SIMD Instructions

So all of those move instructions have the same effect, perhaps at a slightly different speed. You have my permission to use them interchangeably. Use the unaligned ones unless you know something about the alignment of your data.

Another hint that these instructions aren't designed for humans.

SIMD Instructions

Note that all of the move instructions to get data into the SIMD registers rely on the data being adjacent in memory.

If what we have is an array of int64_t or float, then they work with data like we have it. We can load several values into a register and do some calculation. No problem.

SIMD Instructions

But if we have an array of struct or class values, we don't have the memory layout we need.

struct rgb {
    uint8_t red;
    uint8_t green;
    uint8_t blue;
};

If we want to do some calculation on many of those values (e.g. convert to HSV), it's hard to get all of the red values into a SIMD register to start working with them.

SIMD Instructions

So, an array-of-structs or array-of-objects is hard to work with using SIMD operations.

If we had an array of reds, another of greens, and another of blues, then it's easy.

I like classes/​objects. If I need absolute maximum numerical performance, maybe I have a hard choice to make.

SIMD Instructions

There are gather instructions (e.g. vgatherdps/vgatherqps) that can do addressing more like single-value assembly instructions (like 8(%rbp, %rcx, 4)). It might be able to get array-of-object values into SIMD registers.

They seem… challenging.

SIMD Instructions

But let's push through and try our old friend summing an array of double with the floating point SIMD instructions…

When loading data into the ymmn registers, we'll read 32 adjacent bytes from memory. Then we can do vector operations to add up (what I think of as) columns.

SIMD Instructions

e.g. if we have 12 elements in the array, a, b, c, …, l, vector operations will get us:

vector add, step 1

Then we'll do the horizontal add to get the final sum.

SIMD Instructions

Also, I'm going to assume an array length divisible by four (or however many values are held in the registers for other types). We could deal with any remaining values, but let's simplify.

Our function will have this signature: both arguments go in the general purpose registers (%rdi, %rsi) and we return in %xmm0.

double double_sum_2(double* array, uint64_t length);

SIMD Instructions

The basic loop is easy enough: %ymm0 as a 4-part accumulator, %rcx as a counter, move four steps along with each iteration.

double_sum_2:
    vxorpd %ymm0, %ymm0, %ymm0
    mov $0, %rcx
ds2_loop:
    cmp %rsi, %rcx
    jae ds2_return
    vaddpd (%rdi, %rcx, 8), %ymm0, %ymm0
    add $4, %rcx
    jmp ds2_loop

SIMD Instructions

vxorpd %ymm0, %ymm0, %ymm0
vaddpd (%rdi, %rcx, 8), %ymm0, %ymm0

Note the v instructions with the separate destination operand, and %ymmn registers.

SIMD Instructions

When we get to the end of the loop, it gets strangely tricky to finish the sum. The idiom seems to be first, copy the upper 128 bits to the lower 128 bits of another register, so everything is in an %xmmn register.

The vextractf128 instruction does that. The first operand indicates which 128-bit chunk: we want the high 128 bits.

vextractf128 $0x1, %ymm0, %xmm1

SIMD Instructions

vextractf128 $0x1, %ymm0, %xmm1

After that instruction:

vector add, step 2

SIMD Instructions

The two xmm registers can be added easily:

vaddpd %xmm1, %xmm0, %xmm0

Now:

vector add, step 3

SIMD Instructions

Then we can shuffle element 1 to position 0 in another register:

vshufpd $0b01, %xmm0, %xmm0, %xmm1
vector add, step 4

SIMD Instructions

vector add, step 4

A final scalar add will get the sum we want:

vaddsd %xmm1, %xmm0, %xmm0

SIMD Instructions

The full horizontal add after the loop:

    vextractf128 $0x1, %ymm0, %xmm1
    vaddpd %xmm1, %xmm0, %xmm0
    vshufpd $0b01, %xmm0, %xmm0, %xmm1
    vaddsd %xmm1, %xmm0, %xmm0
    ret

This was not meant for people to write, but as an idiom, it's fairly re-usable.

Don't worry about the details of these instructions, just know what this code fragment does.

SIMD Performance

Did we actually get any speed for all that? The non-SIMD version (using addsd for all the additions) took ∼2.4 cycles per element (on my processor).

This version (mostly vaddpd and moving 4 steps per loop): ∼1.5 cycles per element.

A <2× speedup, not 4. Likely inference: the vector instructions are slightly slower than the scalar instructions, but we get lots more work done.

SIMD Performance

A C implementation: ∼2.5 cycles per element: we get to see the compiler's reluctance to rearrange our arithmetic (and the C version can deal with array lengths not divisible by 4).

Remember that we decided here to do the additions in a different order. That means different accumulation of error. Was that the right choice? Depends on the problem. See MACM 316.

SIMD Performance

What if we don't need the precision/​range of double-precision? We can do the same thing with single precision and do eight additions per iteration.

The main loop isn't so different (vaddps for single-precision addition; add 8 to counter):

float_sum_2:
    vxorps %ymm0, %ymm0, %ymm0
    mov $0, %rcx
fs2_loop:
    cmp %rsi, %rcx
    jae fs2_return
    vaddps (%rdi, %rcx, 4), %ymm0, %ymm0
    add $8, %rcx
    jmp fs2_loop

SIMD Performance

The horizontal add is one step longer:

    vextractf128 $0x1, %ymm0, %xmm1
    vaddps %xmm1, %xmm0, %xmm0
    vshufps $0b00001110, %xmm0, %xmm0, %xmm1
    vaddps %xmm1, %xmm0, %xmm0
    vshufps $0b00000001, %xmm0, %xmm0, %xmm1
    vaddss %xmm1, %xmm0, %xmm0
    ret

SIMD Performance

Or in a diagram:

single-precision horizontal add

SIMD Performance

The C equivalent on float still takes ∼2.1 cycles per element, and a non-vector assembly (addss) implementation also takes ∼2.1.

The single-precision SIMD version: ∼0.7 cycles per element.

Compared to double precision: do twice as many things per iteration; iterate half as many times; twice as fast.

SIMD Performance

How fast is 0.7 cycles per element?

This was a 5.5 GHz processor, so that's \(7.86\times 10^{9}\) single-precision values per second.

\[ (7.86\times 10^{9}\mbox{ values/s}) \times (4\mbox{ B/value})\\ = 32.4\mbox{ GB/s}\,. \]

SIMD Performance

That system's DDR5-6000 should have a peak transfer rate of 48 GB/s. The 32.4 GB/s processing is getting close to that.

The sum array is a very minimal calculation: anything more complex (e.g. dot product) is going to be more processor-limited.

SIMD Performance

Aside: GDB can print the contents of a SIMD register, but it doesn't know the type of the contents so you have to.

(gdb) print $ymm0
$2 = {v16_bfloat16 = {-1.075e+08, 0.09961, -1.075e+08, 0.1992, -1.592e-23, 0.2988, -1.075e+08, 0.3984, 0, 0.5, -1.592e-23, 0.5977, 4.168e-08, 0.6992, -1.075e+08, 0.7969}, v16_half = {
    -19.203, 1.4492, -19.203, 1.5742, -0.0027351, 1.6494, -19.203, 1.6992, 0, 1.75, -0.0027351, 1.7744, 0.22498, 1.7998, -19.203, 1.8242},
    v8_float = {0.100000001, 0.200000003, 0.300000012, 0.400000006, 0.5, 0.600000024, 0.699999988, 0.800000012},
    v4_double = {1.3411048210842936e-08, 3.4332283476601335e-06, 9.7656287607605918e-05, 0.00087890645809238774}, v32_int8 = {-51, 
    -52, -52, 61, -51, -52, 76, 62, -102, -103, -103, 62, -51, -52, -52, 62, 0, 0, 0, 63, -102, -103, 25, 63, 51, 51, 51, 63, -51, -52, 76, 63}, v16_int16 = {-13107, 15820, -13107, 15948, 
    -26214, 16025, -13107, 16076, 0, 16128, -26214, 16153, 13107, 16179, -13107, 16204}, v8_int32 = {1036831949, 1045220557, 1050253722, 1053609165, 1056964608, 1058642330, 1060320051, 
    1061997773}, v4_int64 = {4489188110458735821, 4525216907491121562, 4546834186568204288, 4561245704520151859}, v2_int128 = {83475518170512110947896037946428280013, 
    84140132168590259879230348149687058432}}

These were single-precision float, so v8_float (vector of 8 float) was the right interpretation.

Vectorclass Library

Actually writing assembly with the SIMD instructions becomes complex quickly.

There are a few reasons you don't generally need to…

Vectorclass Library

The compiler might help [later topic: “compiler optimization”]. You might also be able to use a library to help.

As an example, Agner Fog's vectorclass library. It lets you explicitly access SIMD functionality from C++, in a way that can be directly translated to SIMD instructions.

Vectorclass Library

I find it easiest to clone the Git repository, or download+unpack.

git clone --branch v2.02.01 https://github.com/vectorclass/version2 vectorclass

Then in your code,

#include "vectorclass.h"

And when compiling, search that directory:

g++ -Ivectorclass …

Vectorclass Library

The library provides C++ types to represent values packed into the SIMD registers. The 256-bit-wide types we're most interested in:

TypeContentsCount
Vec4qint64_t4
Vec8iint32_t8
Vec16sint16_t16
Vec4ddouble4
Vec8ffloat8

Vectorclass Library

Any of those types can be initialized by a constructor that takes either a single constant to put the same value in all of the fields (i.e. broadcast), or n values to set each field.

Vec8f acc = Vec8f(0.0);
Vec8f numbers = Vec8f(0, 1, 2, 3, 4, 5, 6, 7);

Vectorclass Library

For basic operations, the operators you expect are overloaded to do element-by-element operations on the values.

e.g. these represent eight additions and multiplications:

acc += numbers;
acc = acc * numbers;

With the values initialized as on the last slide, we expect acc to be these eight single-precision values:

0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0

Vectorclass Library

Is it? We can index individual values out of the vector and look:

void print_float_vec(Vec8f v) {
    cout << "{ ";
    for( unsigned i=0; i < v.size(); i++ ) {
        cout << v[i] << ' ';
    }
    cout << "}\n";
}

Yes.

{ 0 1 4 9 16 25 36 49 }

Vectorclass Library

Operations with this library are still very low-level: we're usually getting one assembly instruction per C++ operation. (e.g. still have to write the loops, deal with the end of the array if length not divisible by the vector register size, etc.)

But they're a lot easier to spell, and we get type checks to catch many mistakes.

Vectorclass Library

e.g. this will fail to compile.

Vec8f singles = …;
Vec4d doubles = …;
doubles += singles;
error: no match for ‘operator+=’ (operand types are
‘Vec4d’ and ‘Vec8f’)

Analogous assembly would assemble, run, and produce incoherent results.

Vectorclass Library

The types have a .load method that can load values from a memory location into a SIMD register (the vmovupd instruction or similar). They take a pointer to the corresponding number/​type of values:

double array[8] = {0, 10, 20, 30, 40, 50, 60, 70};
Vec4d a, b;
a.load(array);
b.load(array + 4);

After that, we expect a and b to contain:

{ 0 10 20 30 }
{ 40 50 60 70 }

Vectorclass Library

Also .store() to write a register to memory.

There are convenience functions for useful operations. Most notably, horizontal_add which adds the values from the register. With it, we can re-implement the sum of array values in C++…

Vectorclass Library

float array_sum_vc(float* array, uint64_t length) {
    assert(length % 8 == 0);
    Vec8f acc = Vec8f(0);
    Vec8f tmp;
    for (uint64_t i = 0; i < length; i += 8) {
        tmp.load(array + i);
        acc += tmp;
    }
    return horizontal_add(acc);
}

Notes: assumes length is divisible by 8; must explicitly load values from memory into (we assume) a %ymmn register.

Vectorclass Library

With that implementation, we get the same performance as the hand-written SIMD assembly (and in fact, almost exactly the same assembly).

A similar story on ARM in Rust: Using SIMD for Parallel Processing in Rust. They first rely on the compiler [later topic: “Auto Vectorization”], then use intrinsics, then the not-yet-stable std::simd module.

Summary

Of course, not all problems lend themselves to using SIMD instructions, but it can be surprising how many do.

If those problems occur in your life, examining their implementation for vector-based solutions can be a huge speed win.