We need to talk about how computers represent floating point values. Just like for integers, there are details that matter when it comes to working with them.

But before that, let's actually use them.

A quick primer before we do… there are two commonly-used floating point types: single- and double-precision.

Single is C's `float`

type is 32 bits and can represent values with ≈8 (decimal) digits of precision.

Double is C's `double`

type (and Python's `float`

and JavaScript's numeric type and …) is 64 bits and gives ≈15 significant digits.

There are two sets of floating point registers and sets of instructions in x86-64 processors (because of course there are).

Originally, the x86 processors didn't have floating point instructions: if a programmer (or compiler) wanted to do floating point stuff, it had to be done with whatever combination of other instructions (i.e. functions/libraries) would do the job.

The x87 instructions were implemented on optional coprocessors before the i486 (1989), and were then part of the processor itself.

They use registers `ST(0)`

to `ST(7)`

(optionally) as a stack. i.e. there's a zero-operand instruction that can pop `ST(0)`

and `ST(1)`

, divide them, and push the result into `ST(0)`

.

We're going to ignore the x87 instructions in favour of the newer instructions. Basically: if you see `ST(`

in the description of an instruction, that's `n`)**not** the one you want.

Compilers seem to do the same: with `-march=haswell`

(or other appropriate architecture specified), they seem to never emit x87 instructions.

We're going to work with the instructions that were introduced in the MMX instruction set and extended as SSE (Streaming SIMD Extensions), SSE2, SSE3, SSSE3, SSE4, AVX (Advanced Vector Extensions), and AVX2 (and AVX-512 that's not yet well supported, certainly not in our Haswell baseline).

I'm going to call them the vector instructions

, because I don't think they collectively have a name, even though we aren't (yet) using them as vector operations.

The vector instructions work on registers `%xmm0`

to `%xmm15`

(and their wider siblings `%ymm0`

–`%ymm15`

: more later).

There are a lot of registers, which is good, but **none** of them is call-preserved.

The streaming instructions can work on multiple values at once (vector processing or SIMD, single-instruction multiple-data) but we're going to come back to that later. [later topic: “SIMD”]

For now: look for `%xmm`

in the docs, and the word `n`scalar

in the description of the instruction, which indicates the non-vector version.

There are all of these ways to divide (and a few more). Of these, `fdiv`

is **not** the one you want: it's x87.

`div`

:**unsigned integer**divide on the general purpose registers (`%rax`

and friends).`idiv`

:**signed integer**divide on the general purpose registers.`fdiv`

:**x87**double-precision floating point divide.`divss`

:Divide

.**Scalar****Single**-Precision Floating-Point Value`divsd`

:Divide

.**Scalar****Double**-Precision Floating-Point Value

Remember the calling convention we have been using for assembly: return value in `%rax`

, first argument in `%rdi`

, second argument in `%rsi`

, ….

Those apply to integer (and pointer) arguments.

For floating point values, the arguments are in `%xmm0`

, `%xmm1`

, …, `%xmm8`

. A floating point return value must be put in `%xmm0`

.

If you have both integer (or pointer) and floating point arguments to a function, `%rdi`

is the first integer argument and `%xmm0`

is the first floating point argument, no matter where they are in the list.

So in this function, `a`

and `b`

will be in `%rdi`

and `%rsi`

; `x`

and `y`

will be in `%xmm0`

and `%xmm1`

; return in `%rax`

.

uint64_t f(uint64_t a, uint64_t b, double x, double y);

Here, `length`

and `width`

in `%xmm0`

and `%xmm1`

; `count`

in `%rdi`

; return in `%xmm0`

.

double f(double length, double width, uint64_t count);

With those details out of the way, working with floating point is mostly straightforward.

Let's reproduce the first-ever assembly from the lab, except a floating-point version. The equivalent of:

double multiply_add_c(double a, double b, double c) { return a + b * c; }

There are instructions for double-precision floating point addition and multiplicaton: `addsd`

(ADD Scalar Double precision

) and `mulsd`

(MULtiply Scalar Double precision

).

Then to calculate and return `a + b * c`

:

# %xmm0: a and return # %xmm1: b # %xmm2: c multiply_add: mulsd %xmm2, %xmm1 # b *= c addsd %xmm1, %xmm0 # a += b ret

The floating point instruction set is much richer than integers in many ways (that we mostly won't discuss). It turns out there's a single instruction for that:

multiply_add_2: vfmadd231sd %xmm1, %xmm2, %xmm0 ret

The catchy mneumonic

and name `vfmadd231sd`

Fused Multiply-Add of Scalar Double Precision Floating-Point Values

are perhaps a hint that some of this functionality is designed more for compilers than humans.

But for the more normal

floating point instructions, they work like the integer instructions, implementing in-place (`+=`

and similar) operations. Just with different registers.

One notable thing we lose for floating point: immediate values or constants. There is no floating point/`%xmm`

register equivalent of:`n`

mov $1234, %rax

There is an idiom to set a register to zero. Three relevant facts:

- The floating point representation of 0.0 is all 0 bits.
- XORing anything with itself is all 0 bits.
`pxor`

is the`%xmm`

register XOR operation.`n`

This will set `%xmm4`

to 0.0:

pxor %xmm4, %xmm4

Or an instruction that sets **all** of the `%xmm`

registers to zero, in case you `n`*really* want to clean up:

vzeroall

Otherwise, the easiest thing to do it find the in-memory representation of the floating point value (which for now we just have to accept exists: details later).

Those bits can be loaded into a register.

Let's say we want to calculate the circumference of a circle: \(\pi\) times the diameter.

We need to find out how to represent \(\pi\) in binary in our code (in a `.data`

segment). I copied-and-pasted a decimal representation of \(\pi\) into float.exposed to find if I treat the bits like a double-precision float,

`0x400921fb54442d18`

≈ \(\pi\)

Then we can do the calculation by loading the constant through a general-purpose register:

circumference: movabs $0x400921fb54442d18, %rdi movq %rdi, %xmm1 mulsd %xmm1, %xmm0 ret

The `movq`

instruction is Move Quadword

. It's **not** `mov`

with the `q`

suffix.

`mulsd`

: Multiply Scalar Double Precision Floating-Point Value

.

Or we could load the constant from memory if we prefer:

.data pi: .quad 0x400921fb54442d18

circumference2: movq pi(%rip), %xmm1 mulsd %xmm1, %xmm0 ret

Let's try another problem similar to what we have done on integers: a comparison function that returns -1 for `a<b`

; 1 for `a>b`

; and 0 for `a==b`

.

The `comisd`

instruction (Compare Scalar Double

) will compare similarly to `cmp`

and set flags as `cmp`

would for **unsigned** integers. (i.e. `ja`

and `jb`

not `jg`

and `jl`

, and corresponding conditional moves.)

I'm going to use conditional moves:

double_cmp: mov $-1, %r11 # constant for cmov mov $1, %r10 # constant for cmov mov $0, %rax comisd %xmm0, %xmm1 cmova %r11, %rax cmovb %r10, %rax ret

Swapping in `cmp %rdi, %rsi`

would make this the unsigned integer version of the function.

Note that the cmov`x` instructions do **not** modify the status bits: we can do one status-bit-setting operation and then use those bits several times.

There is a `ptest`

instruction that's analogous to `test`

, but it doesn't set the sign flag, so probably `comisd`

and zero is a better tool for us to remember.

One more complication for floating point values: the `push`

and `pop`

instruction cannot be used with the `%xmm`

registers.`n`

But, remember that these code snippets do exactly the same thing:

push %rbx ⋮ pop %r14

sub $8, %rsp mov %rbx, (%rsp) ⋮ mov (%rsp), %r14 add $8, %rsp

The `%xmm`

registers can be read from or written to memory, so we can use the stack `n`manually

if we need to:

sub $8, %rsp movq %xmm5, (%rsp) ⋮ movq (%rsp), %xmm6 add $8, %rsp

Yet another hint that this end of the instruction set is designed more for compilers than people.

That's probably enough floating point instructions for now. Let's have a better look at what floating point

actually is…

Let's start with how fractional point values turn into binary.

Recall how we dealt with unsigned integers: in decimal, each digit represents the next power of ten. In binary, each bit represents the next power of two:

\[\begin{align*} 1001_{10} &= (1\times 10^3) + (0\times 10^2) + (0\times 10^1) + (1\times 10^0) \,, \\ 1001_2 &= (1\times 2^3) + (0\times 2^2) + (0\times 2^1) + (1\times 2^0) \,. \end{align*}\]

When we have fractional parts in decimal, we just add a decimal point, and count down negative powers of ten as we go right:

\[\begin{align*} 2.34_{10} = (2\times 10^0) + (3\times 10^{-1}) + (4\times 10^{-2}) \,. \end{align*}\]

We can do the same thing in binary:

\[\begin{align*} 10.01_{2} &= (1\times 2^1) + (0\times 2^0) + (0\times 2^{-1}) + (1\times 2^{-2}) \\ &= 2.25_{10}\,. \end{align*}\]

The positions to the right of the decimal

place in binary are 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256, … .

We can deal with values the same was as decimal. We can convert the fraction 557/128 to binary:

\[\begin{align*} \frac{557}{128} &= 4 + \frac{0}{2} + \frac{1}{4} + \frac{0}{8} + \frac{1}{16} + \frac{1}{32} + \frac{0}{64} + \frac{1}{128} \\ &= 100.0101101_2\,. \end{align*}\]

Not every value that has a non-repeating decimal expansion terminates in binary. e.g. 1/10:

\[ \tfrac{1}{10} = 0.1_{10} \,. \]

But in binary:

\[\begin{align*}\ \tfrac{1}{10} &= \tfrac{1}{16} + \tfrac{1}{32} + \tfrac{1}{256} + \tfrac{1}{512} + \tfrac{1}{4096} + \tfrac{1}{8192} + \cdots \\ &= 0.0001100110011\cdots \\ &= 0.0\overline{0011}_{2}\,. \end{align*}\]

Binary is straightforward enough, but what about *bits*? How can we store binary fractional things in a reasonable way in a computer?

Just like with integers, we will start by deciding how many bits we want to store for each value, typically 32 or 64.

We will use the IEEE 754 standard to represent values as \((-1)^s\cdot M\cdot 2^E\) where:

- \(s\) is one bit for the sign: 0 for positive, 1 for negative;
- \(M\) is the significand (or mantissa), fractional binary 0–2;
- \(E\) is the exponent, an integer.

The sign is easy enough. The other two need some explanation. They are more complicated than they sound…

We will use one bit for the sign, and split the others between bits for the significand (`frac`) and the exponent (`exp`).

A double precision floating point value (a C `double`

) uses 64 bits: 1 for `s`, 11 for `exp`, and 52 for `frac`:

A single precision floating point value (a C `float`

) uses 32 bits: 1 for `s`, 8 for `exp`, and 23 for `frac`.

How those bits get interpretted is a little complicated.

Note on notation: we're using `exp` and `frac` for the **bits** in the binary representation, and \(E\) and \(M\) for the values that go into the \((-1)^s\cdot M\cdot 2^E\) formula. There's translation between the two that we'll discuss.

Most IEEE 754 floating point values are normalized, so let's look at that case first. Values are normalized if `exp` is **anything except** all 0s or all 1s. Then to interpret the bits…

We decide on a bias for our exponent. For single precision, bias is 127; for double precision, it's 1023.

Then look at `exp` as an **unsigned** integer and, \( E = exp - bias \).

e.g. in single-precision we might have `exp` bits 01111001. Then, \[ E = 01111001_2 - \mathit{bias} = 121 - 127 = -6\,. \]

For normalized values, the `frac` bits are taken to be the right-of-decimal-point bits, **and add one**. e.g. if the single-precision `frac` is 01001011000000000000000, it represents \[\begin{align*} M &= 1 + 0.01001011000000000000000_2 \\ &= 1 + \tfrac{1}{4} + \tfrac{1}{32} + \tfrac{1}{128} + \tfrac{1}{256} \\ &= 1 + \tfrac{75}{256} = 1.29296875\,. \end{align*}\]

So, if the previous examples have sign bit 0, the 32-bit single-precision float would be `sign`+`exp`+`frac`:

00111100101001011000000000000000

and it represents

\[\begin{align*} (-1)^s\cdot M\cdot 2^E &= (-1)^0 \cdot 1.29296875 \cdot 2^{-6} \\ &= +0.02020263671875_{10} \,. \end{align*}\]

A floating point number is a denormalized value when the `exp` is all 0s.

Then the exponent is always \(E=1-\mathit{bias}\).

The `frac` values are interpreted as before: as the bits to the right of the decimal point, but **without adding one**.

For example, this is a denormalized single-precision value with the same `frac` bits as the previous example:

00000000001001011000000000000000

Since the bias is 127 for single-precision \(E=1-127=-126\) and this value is:

\[\begin{align*} (-1)^s\cdot M\cdot 2^E &= (-1)^0 \cdot 0.29296875 \cdot 2^{-126} \\ &\approx 3.4438311\times 10^{-39} \,. \end{align*}\]

For denormalized values, the significand is always going to be \(0\le M < 1\) and it's going to be multipled by a very-negative power of two (\(2^{-126}\) for single-precision; \(2^{-1022}\) for double-precision). So, all of the denormalized values are very close to zero.

For normalized values, the significand is \(1\le M < 2\) and multiplied by a power of two with a much larger range.

The normalized numbers cover most of the range for floating point values; the denormalized handle the very-close-to-zero values (and zero itself).

The normalized and denormalized values have a clean transition at \(2^{1-\mathit{bias}}\). The largest denormalized single:

00000000011111111111111111111111

\(= (0.111\ldots_2) \cdot 2^{1-\mathit{bias}} \approx 1.175494211\times 10^{-38} \).

The smallest positive normalized single:

00000000100000000000000000000000

\(= 1\cdot 2^{1-\mathit{bias}} \approx 1.175494351\times{}10^{-38}\).

There's a case we haven't looked at: `exp` is all 1s.

These represent special values. When `frac` is all 0, the value represents \(+\infty\) or \(-\infty\), depending on the sign bit.

e.g. in C, `1.0/0.0`

gives `inf`

.

With other `frac` values, the represented value is not a number or NaN.

e.g. in C, `sqrt(-1.0)`

gives `nan`

. It can also be used to represent something like missing data

.

Or all of that, all in one table:

exp | \(E\) | frac | \(M\) | Meaning |
---|---|---|---|---|

all 0 | \(1-\mathit{bias}\) | anything | 0.frac | \((-1)^s\cdot M\cdot 2^E\) |

all 1 | all 0 | \((-1)^s\cdot\infty\) | ||

all 1 | else | \((-1)^s\cdot\mathrm{NaN}\) | ||

else | \(\mathit{exp}-\mathit{bias}\) | anything | 1.frac | \((-1)^s\cdot M\cdot 2^E\) |

IEEE 754 specifies these floating point representations, and also how to do operations on them (`+`

, `/`

, `sqrt`

, …), rounding rules, exception handling (e.g. `1.0/0.0`

is `inf`

; `sqrt(-1.0)`

is `nan`

), etc.

All (most? all normal

?) modern CPUs and languages implement IEEE 754, so you have some assurance that if you do a floating point calculation in C on x86, or in Rust on ARM, you'll get **exactly** the same result.

To represent a real number \(x\) (as closely as possible) as a floating point value, we need to figure out the \(s\), \(M\), and \(E\) necessary for make \((-1)^s\cdot M\cdot 2^E\) equal to that value.

The \(s\) should be easy: 0 for positive, 1 for negative.

As long as we're not too close to 0, it will be a normalized value. Let's assume \(s=0\) (i.e. \(x\) is positive) for simplicity and then \(x = M\cdot 2^E\) and, \[ 1\le M < 2 \\ 2^E\le M\cdot 2^E < 2^{E+1} \]

So, \(2^E\) is the largest power of 2 less than or equal to \(x\) and \( E = exp - bias \).

Now we can figure out \(s\) and \(E\), so just have to choose an \(M\) so that, \[\begin{align*} x &= (-1)^s\cdot M\cdot 2^E \\ M &= \frac{x}{(-1)^s \cdot 2^E} \,. \end{align*}\]

We still know \(1\le M < 2\), so it's just a question of picking the right-of-decimal-point bits to make the closest possible approximation of our value.

Let's try to represent \(+\frac{1}{10}\) in single-precision…

We can start with \(s=0\) and \[2^{-4} \le \tfrac{1}{10} < 2^{-3}\,,\] so \(E=-4\). The bias for single-precision is 127 for single-precision, so `exp` needs to be \[\begin{align*} -4+127 &= 123_{10} \\ &= 01111011_{2} \end{align*}\]

Then, \[ M = \frac{x}{(-1)^s \cdot 2^E} = \frac{\frac{1}{10}}{1\cdot 2^{-4}} = \tfrac{16}{10} = \tfrac{8}{5}\,. \]

Now we have 23 bits for `frac` and just need to get as close as possible to \(M=\tfrac{8}{5}\). \[\begin{align*} \tfrac{8}{5} &\approx 1 + \tfrac{1}{2^{1}} + \tfrac{1}{2^{4}} + \tfrac{1}{2^{5}} + \tfrac{1}{2^{8}} + \tfrac{1}{2^{9}} + \tfrac{1}{2^{12}} + \tfrac{1}{2^{13}}+\cdots \\ &\approx 1.10011001100110011001101_2\,. \end{align*}\]

… and take the right-of-decimal-point bits as `frac`.

After all of that, we got this for \(\frac{1}{10}\):

00111101110011001100110011001101

Or we could have used a converter. We can double-check in C:

uint32_t bits = 0b00111101110011001100110011001101; float f = *(float*)&bits; // treat that as single-precison printf("%.15f\n", f);

0.100000001490116

Another example that's worth remembering…

The all-0-bits value in any IEEE floating point will have `exp` all-0s, so it's a denormalized value. Also, \(s=0\) so it's positive.

The significand is going to be 0.000…, so it represents

\[ (-1)^s\cdot M\cdot 2^E = (-1)^0\cdot 0.0\cdot 2^E = +0.0\,. \]

If the first bit (\(s\)) was 1, that represents \(-0.0\).

Small integers can be represented exactly. For example, \(1=+2^0\cdot{}1\) so we can represent it with a normalized value, \(E=0\) and \(S=1.000\ldots\).

In single-precision, that would be:

00111111100000000000000000000000

Also, rational numbers with (small-ish) power-of-two denominators: \(3/2=1.5\) is:

00111111110000000000000000000000

\(-3/8=-0.375\):

10111110110000000000000000000000

There are several sizes

of IEEE floating point values that are commonly used. We have seen single- and double-precision, but there are others (later).

The largest and smallest values possible for a floating point type is controlled mostly by the number of bits in the exponent: we want the largest possible exponent without being one of the special

values (111…1110) and the largest possible significand (111…111).

0111…110111111…111111

For single-precision, that's \(\pm 3.40\times 10^{38}\). For double, \(\pm 1.80\times 10^{308}\).

The number of bits in the significand determines how precisely we can represent a value: more bits means more precision in \(M\).

For example, we couldn't represent \(\frac{1}{10}\) exactly, but the closest single-precision values are:

0.0999999940395355224609

0.1000000014901161193850

In double-precision, the closest choices are:

0.0999999999999999916733

0.1000000000000000055510

So we have a tradeoff: more bits means more range and precision, but also more memory used for each value.

Storing 32- vs 64-bits doesn't sound like a big deal, but if you have an array of millions of values, it might be. There are also times when calculating on single-precision is about twice as fast as double-precision. [later topic: “SIMD”]

There are some other choices of floating-point size you might see…

Name | Total Bits | exp bits | frac bits | Bias | Common Use |
---|---|---|---|---|---|

Double precision | 64 | 11 | 52 | 1023 | precision or range matters, general use |

Single precision | 32 | 8 | 23 | 127 | if tradeoff of precision/range for size/speed makes sense |

Half precision | 16 | 5 | 10 | 15 | graphics, deep learning where speed matters much more than precision and/or range |

bfloat16 | 16 | 8 | 7 | 127 | |

Minifloat | 8 | 4 | 3 | 1? | Lecture examples? Deep Learning? |

There are x86 instructions for single- and double-precision, and those are by far the most commonly used.

The (not yet widely supported) AVX-512 instructions add some half precision and bfloat16 vector instructions.

Bfloat16 is supported in Apple M2 chips and some machine learning accelerators (like Google's TPUs).

We're going to have a limited number of bits we can store (probably 32 or 64), so we can't represent *every* real (or rational) number: there are too many of them.

In decimal, we can't represent \(\frac{1}{3}\) exactly (with a finite number of digits). We can get close with \(0.3333333_{10}\), but never an exactly equal value.

In floating point, we saw that we can't represent 1/10 exactly, but we can get pretty close in single-precision. That's what you get when you type

in C.`float x = 0.1`

Doing

gets closer, but still not exact.`double x = 0.1`

This isn't a C-specific problem. Many programmers are alarmed by things like this in Python (which uses double-precision IEEE 754 for its `float`

type):

>>> 0.1 + 0.2 0.30000000000000004 >>> 0.3 0.3 >>> 0.1 + 0.2 == 0.3 False

The rounding-off that's done with `0.1`

and `0.2`

cause their sum to be slightly different than the rounding on the literal value `0.3`

.

Languages make different choices about what they *display*, but if they're using IEEE 754, the results are the same. This…

double x1 = 0.1; double x2 = 0.2; double x3 = 0.3; printf("%g %g\n", x1 + x2, x3);

prints:

0.3 0.3

But this…

printf("%.17g\n", x1 + x2); printf("%.17g\n", x3); printf("%s\n", x1+x2 == x3 ? "true" : "false");

prints:

0.30000000000000004 0.29999999999999999 false

Note that even the smallest difference, many decimal places away from the part of the number you care about makes floating point values not-equal. Differences in rounding make it hard to predict where they will occur.

>>> 0.29999999999999999 == 0.3 True >>> 0.30000000000000004 == 0.3 False >>> 0.29999999999999999 == 0.30000000000000004 False >>> 0.6/2 == 0.1+0.2 False

As a result, comparing floating point values for equality is almost never the right thing to do.

double x = ... double y = ... if ( x == y ) { /* nope */ } if ( abs(x - y) < 0.0000001 ) { /* better */ }

The two 16-bit floating point options, half and bfloat16, can give us a better idea of the tradeoffs in the representation.

Half-precision: 5 bits for `exp`, 10 for `frac`, and a bias of 15.

Bfloat16: 8 bits for `exp`, 7 for `frac`, and a bias of 127.

Half-precision is balanced more like single and double: many more bits for `frac` then `exp`. That leaves it with a smaller range: the limits are \(\pm 65504\).

Bfloat16 has the same number of bits for `exp` as single precision and takes the 16 bits from `frac`. Its limits are \(\pm 3.39\times 10^{38}\): basically the same as singles. It chooses range over precision.

For half-precision values, the next value after 1.0 that can be represented is

0011110000000001 = 1.0009765625.

For bfloat16, it's

0011111110000001 = 1.0078125.

Less precision, more range. It was designed for applications like accelerated machine learning where speed is much more important than precision, so that makes sense.

With 64 bits, we get floating point range \(\pm 1.80\times 10^{308}\) and with 32 bits, \(\pm 3.40\times 10^{38}\).

For 64-bit signed integers, \(\pm 9.22\times 10^{18}\) and for 32-bit signed, \(\pm 2.15\times 10^{9}\).

So if we're working with big numbers, floating point loses precision, but we get much larger range

The design of the floating point representation ensures that representable values are closer together near zero, and farther apart for larger values (or more negative values).

It's hard to see the details, but easier if we drop to an extremely small example…

Let's look at an 8-bit floating point values with 1 bit for sign, 4 for exponent, 3 for significand, and a bias of 1. With one byte, there are only 256 unique values, and they are listed on the Wikipedia Minifloat page.

The rows that start 0 0000

and 1 0000

are the denormalized values. The 0 1111

and 1 1111

rows are the special values. The rest are normalized values.

Some observations:

- Within each exponent, there is a linear sequence of values that can be represented (e.g. 1, 1.125, 1.25, 1.375, 1.5, 1.625, 1.75, 1.875 for
`exp`=0001). - The error you're going to get on a single value (worst-possible rounding from
truth

to a representable value) is relative to the magnitude of the value.

- Within its range, minifloat values can represent values within about 5% of their
true

value. Bad, but consistently bad across the range. - Positive and negative zero are
**different**values (e.g. in double precision,`1/1e400 == +0.0`

and`-1/1e400 == -0.0`

).

Single- and double-precision are basically that, but more.

More bits for `exp` mean a larger range, and more precision around zero (because more powers-of-2 in both the positive and negative direction).

More bits for `frac` mean more precision in general: more bits to the right of the decimal place, so can represent values more exactly.

Precision errors are something you might not like in floating point calculations, but can't avoid. For more: MACM 316.

There may be cases where floating point simply is not the right data type for the calculations you're doing, even though it might look like it initially.

e.g. for arithmetic on money values, you would be annoyed to find out that $0.20 + $0.10 ≠ $0.30. The usual advice is to not think of $12.34

as the float 12.34, but as an integer number of cents, 1234¢

(or integer number of whatever the smallest unit your currency has).

Then do integer arithmetic and format as dollars only for display.

e.g. There are libraries for rational numbers (`fractions.Fraction`

in Python, Boost's Rational class for C++) and arbitrary-precision floating point (`decimal.Decimal`

in Python, Boost.Multiprecision for C++).

Calculations on them will be slower than raw floating point calculations, but correct code is better than fast code: if you need them, use them.

An application of FP that's maybe surprising: recording audio.

To store audio, we need to represent a sample

of the sound wave many times per second (44.1 kHz or so). How do we represent those samples? Traditionally with 16-bit integers (maybe 24- or 32-bit).

e.g. CD quality

audio = 16-bit integers, 44.1 kHz.

The issue for the moment: if the audio is too loud, it can clip

. Basically, the tops/bottom of the waveforms get cut off.

When *recording* audio, levels can be very different, depending on the situation (microphone sensitivity, loudness of the sound, how close the microphone is).

Traditionally, it was very important to set gain

so the levels were large enough to be captured (there aren't many integers near zero, so the signal needs to be relatively loud so the detail is there), but not so high signals clip (which distorts the sound).

For 16-bit signed integers, we could represent the level

of the waveform in the range -32768–32767.

These standards come from a time when storage and compute were very limited: even 16-bit integers were good enough and we could store them and manipulate them (44.1k times per second) on computers we had at the time.

But if we switched to 32-bit floating point values, the range would become \(\pm 3.40\times 10^{38}\) and clipping is almost impossible.

Also, really quiet sounds still have good fidelity because of the precision of floats closer to zero.

It turns out 32-bit float audio is a thing. It's amusing to hear people talk about single-precision float like it's a *new* thing.

Basically, sound level doesn't matter when recording 32-bit float (until things are so loud your microphone was probably going to explode anyway). Fix it in post.

When it comes to audio *playback*, integers are probably fine: levels have to be in a kind-of-reasonable range so different audio has similar levels. (i.e. you expect volume at 70% should be similarly-loud for all songs, so wildly-different levels aren't worth worrying about.)

One thing that should be clear: IEEE floating point numbers are a decent approximation of real numbers (i.e. \(\mathbb{R}\) from the math class), but they are definitely different. For **real numbers**, we know the basic operations are associative:

\[ (a+b)+c = a+(b+c)\,. \]

Addition and multiplication on floating point numbers is **not** associative:

>>> (0.7 + 0.2) + 0.1 0.9999999999999999 >>> 0.7 + (0.2 + 0.1) 1.0 >>> (0.12345678 * 0.12345678) * 0.56789 0.008655538894467976 >>> 0.12345678 * (0.12345678 * 0.56789) 0.008655538894467974

In other words: algebra (as you understand it) doesn't work on floats.

\[ (a+b)+c \approx a+(b+c)\,,\\ (a\times b)\times c \approx a\times (b\times c)\,. \]

It turns out we have seen this already. You just didn't notice.

When adding arrays in the pipeline demo, I used `float`

(single precision) as the array elements.

#define DATA_T float

Then did a different calculation that had different performance…

DATA_T array_sum_1(DATA_T* arr, uint64_t n) { DATA_T total = 0; for (uint64_t i = 0; i < n; i++) { total += arr[i]; } return total; }

DATA_T array_sum_3(DATA_T* arr, uint64_t n) { DATA_T total1 = 0, total2 = 0; for (uint64_t i = 0; i < n - 1; i += 2) { total1 += arr[i]; total2 += arr[i + 1]; } return total1 + total2; }

The output from summing the same array of small numbers with those functions:

array_sum_1 ... Result: 173.1003113 array_sum_2 ... Result: 173.1003113 array_sum_3 ... Result: 173.1308441

The additions were done in a different order, so the rounding errors accumulated differently.

Summing many small values (exactly like this problem) is one of the places rounding errors can get really bad.

I also said at the time that the speed difference in that demo disappeared when changing from `float`

to an integer type.

At least part of the reason: integer addition **is** associative*. The compiler is free to change the order of operations all it wants, because the result of the calculation will be the same. It *might have* done that optimization for us.

The compiler won't rearrange floating point operations because of the implicit contract you have with it: the compiled code is going to produce exactly the result you asked for.

The compiler can rearrange/change/skip operations as much as it likes, as long as the result is **exactly** the same.

The result: we'll generally see much more modest compiler optimizations around FP code. The compiler is almost always bound to do exactly the calculations you requested, in exactly the order you wrote them.

We'll talk more later about how to ask the compiler nicely to be less precise with floating point operations.