Floating Point

Floating Point

We need to talk about how computers represent floating point values. Just like for integers, there are details that matter when it comes to working with them.

But before that, let's actually use them.

Floating Point

A quick primer before we do… there are two commonly-used floating point types: single- and double-precision.

Single is C's float type is 32 bits and can represent values with ≈8 (decimal) digits of precision.

Double is C's double type (and Python's float and JavaScript's numeric type and …) is 64 bits and gives ≈15 significant digits.

x86 Floating Point

There are two sets of floating point registers and sets of instructions in x86-64 processors (because of course there are).

Originally, the x86 processors didn't have floating point instructions: if a programmer (or compiler) wanted to do floating point stuff, it had to be done with whatever combination of other instructions (i.e. functions/​libraries) would do the job.

x86 Floating Point

The x87 instructions were implemented on optional coprocessors before the i486 (1989), and were then part of the processor itself.

They use registers ST(0) to ST(7) (optionally) as a stack. i.e. there's a zero-operand instruction that can pop ST(0) and ST(1), divide them, and push the result into ST(0).

x86 Floating Point

We're going to ignore the x87 instructions in favour of the newer instructions. Basically: if you see ST(n) in the description of an instruction, that's not the one you want.

Compilers seem to do the same: with -march=haswell (or other appropriate architecture specified), they seem to never emit x87 instructions.

x86 Floating Point

We're going to work with the instructions that were introduced in the MMX instruction set and extended as SSE (Streaming SIMD Extensions), SSE2, SSE3, SSSE3, SSE4, AVX (Advanced Vector Extensions), and AVX2 (and AVX-512 that's not yet well supported, certainly not in our Haswell baseline).

I'm going to call them the vector instructions, because I don't think they collectively have a name, even though we aren't (yet) using them as vector operations.

x86 Floating Point

The vector instructions work on registers %xmm0 to %xmm15 (and their wider siblings %ymm0%ymm15: more later).

There are a lot of registers, which is good, but none of them is call-preserved.

x86 Floating Point

The streaming instructions can work on multiple values at once (vector processing or SIMD, single-instruction multiple-data) but we're going to come back to that later. [later topic: “SIMD”]

For now: look for %xmmn in the docs, and the word scalar in the description of the instruction, which indicates the non-vector version.

x86 Floating Point

There are all of these ways to divide (and a few more). Of these, fdiv is not the one you want: it's x87.

  • div: unsigned integer divide on the general purpose registers (%rax and friends).
  • idiv: signed integer divide on the general purpose registers.
  • fdiv: x87 double-precision floating point divide.
  • divss: Divide Scalar Single-Precision Floating-Point Value.
  • divsd: Divide Scalar Double-Precision Floating-Point Value.

Calling Convention

Remember the calling convention we have been using for assembly: return value in %rax, first argument in %rdi, second argument in %rsi, ….

Those apply to integer (and pointer) arguments.

Calling Convention

For floating point values, the arguments are in %xmm0, %xmm1, …, %xmm8. A floating point return value must be put in %xmm0.

If you have both integer (or pointer) and floating point arguments to a function, %rdi is the first integer argument and %xmm0 is the first floating point argument, no matter where they are in the list.

Calling Convention

So in this function, a and b will be in %rdi and %rsi; x and y will be in %xmm0 and %xmm1; return in %rax.

uint64_t f(uint64_t a, uint64_t b, double x, double y);

Here, length and width in %xmm0 and %xmm1; count in %rdi; return in %xmm0.

double f(double length, double width, uint64_t count);

FP Instructions

With those details out of the way, working with floating point is mostly straightforward.

Let's reproduce the first-ever assembly from the lab, except a floating-point version. The equivalent of:

double multiply_add_c(double a, double b, double c) {
    return a + b * c;

FP Instructions

There are instructions for double-precision floating point addition and multiplicaton: addsd (ADD Scalar Double precision) and mulsd (MULtiply Scalar Double precision).

Then to calculate and return a + b * c:

# %xmm0: a and return
# %xmm1: b
# %xmm2: c
    mulsd %xmm2, %xmm1  # b *= c
    addsd %xmm1, %xmm0  # a += b

FP Instructions

The floating point instruction set is much richer than integers in many ways (that we mostly won't discuss). It turns out there's a single instruction for that:

    vfmadd231sd %xmm1, %xmm2, %xmm0

The catchy mneumonic vfmadd231sd and name Fused Multiply-Add of Scalar Double Precision Floating-Point Values are perhaps a hint that some of this functionality is designed more for compilers than humans.

FP Instructions

But for the more normal floating point instructions, they work like the integer instructions, implementing in-place (+= and similar) operations. Just with different registers.

FP Instructions

One notable thing we lose for floating point: immediate values or constants. There is no floating point/​%xmmn register equivalent of:

mov $1234, %rax

FP Instructions

There is an idiom to set a register to zero. Three relevant facts:

  • The floating point representation of 0.0 is all 0 bits.
  • XORing anything with itself is all 0 bits.
  • pxor is the %xmmn register XOR operation.

This will set %xmm4 to 0.0:

pxor %xmm4, %xmm4

FP Instructions

Or an instruction that sets all of the %xmmn registers to zero, in case you really want to clean up:


FP Instructions

Otherwise, the easiest thing to do it find the in-memory representation of the floating point value (with for now we just have to accept exists: details later).

Those bits can be loaded into a register.

FP Instructions

Let's say we want to calculate the radius of a circle: \(\pi\) times the diameter.

We need to find out how to represent \(\pi\) in binary in our code (in a .data segment). I copied-and-pasted a decimal representation of \(\pi\) into float.exposed to find if I treat the bits like a double-precision float,

0x400921fb54442d18 ≈ \(\pi\)

FP Instructions

Then we can do the calculation by loading the constant through a general-purpose register:

    movabs $0x400921fb54442d18, %rdi
    movq %rdi, %xmm1
    mulsd %xmm1, %xmm0

The movq instruction is Move Quadword. It's not mov with the q suffix.

mulsd: Multiply Scalar Double Precision Floating-Point Value.

FP Instructions

Or we could load the constant from memory if we prefer:

    .quad 0x400921fb54442d18
    movq pi(%rip), %xmm1
    mulsd %xmm1, %xmm0

FP Instructions

Let's try another problem similar to what we have done on integers: a comparison function that returns -1 for a<b; 1 for a>b; and 0 for a==b.

The comisd instruction (Compare Scalar Double) will compare similarly to cmp and set flags as cmp would for unsigned integers. (i.e. ja and jb not jg and jl, and corresponding conditional moves.)

FP Instructions

I'm going to use conditional moves:

    mov $-1, %r11   # constant for cmov
    mov $1, %r10    # constant for cmov
    mov $0, %rax
    comisd %xmm0, %xmm1
    cmova %r11, %rax
    cmovb %r10, %rax

Swapping in cmp %rdi, %rsi would make this the unsigned integer version of the function.

Note that the cmovx instructions do not modify the status bits: we can do one status-bit-setting operation and then use those bits several times.

FP Instructions

There is a ptest instruction that's analogous to test, but it doesn't set the sign flag, so probably comisd and zero is a better tool for us to remember.

FP Instructions

One more complication for floating point values: the push and pop instruction cannot be used with the %xmmn registers.

But, remember that these code snippets do exactly the same thing:

push %rbx
pop %r14
sub $8, %rsp
mov %rbx, (%rsp)
mov (%rsp), %r14
add $8, %rsp

FP Instructions

The %xmmn registers can be read from or written to memory, so we can use the stack manually if we need to:

sub $8, %rsp
movq %xmm5, (%rsp)
movq (%rsp), %xmm6
add $8, %rsp

Yet another hint that this end of the instruction set is designed more for compilers than people.

FP Instructions

That's probably enough floating point instructions for now. Let's have a better look at what floating point actually is…

Fractions in Binary

Let's start with how fractional point values turn into binary.

Recall how we dealt with unsigned integers: in decimal, each digit represents the next power of ten. In binary, each bit represents the next power of two:

\[\begin{align*} 1001_{10} &= (1\times 10^3) + (0\times 10^2) + (0\times 10^1) + (1\times 10^0) \,, \\ 1001_2 &= (1\times 2^3) + (0\times 2^2) + (0\times 2^1) + (1\times 2^0) \,. \end{align*}\]

Fractions in Binary

When we have fractional parts in decimal, we just add a decimal point, and count down negative powers of ten as we go right:

\[\begin{align*} 2.34_{10} = (2\times 10^0) + (3\times 10^{-1}) + (4\times 10^{-2}) \,. \end{align*}\]

We can do the same thing in binary:

\[\begin{align*} 10.01_{2} &= (1\times 2^1) + (0\times 2^0) + (0\times 2^{-1}) + (1\times 2^{-2}) \\ &= 2.25_{10}\,. \end{align*}\]

Fractions in Binary

The positions to the right of the decimal place in binary are 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256, … .

We can deal with values the same was as decimal. We can convert the fraction 557/128 to binary:

\[\begin{align*} \frac{557}{128} &= 4 + \frac{0}{2} + \frac{1}{4} + \frac{0}{8} + \frac{1}{16} + \frac{1}{32} + \frac{0}{64} + \frac{1}{128} \\ &= 100.0101101_2\,. \end{align*}\]

Fractions in Binary

Not every value that has a non-repeating decimal expansion terminates in binary. e.g. 1/10:

\[ \tfrac{1}{10} = 0.1_{10} \,. \]

But in binary:

\[\begin{align*}\ \tfrac{1}{10} &= \tfrac{1}{16} + \tfrac{1}{32} + \tfrac{1}{256} + \tfrac{1}{512} + \tfrac{1}{4096} + \tfrac{1}{8192} + \cdots \\ &= 0.0001100110011\cdots \\ &= 0.0\overline{0011}_{2}\,. \end{align*}\]

Representing Fractions

Binary is straightforward enough, but what about bits? How can we store binary fractional things in a reasonable way in a computer?

Just like with integers, we will start by deciding how many bits we want to store for each value, typically 32 or 64.

Representing Fractions

We will use the IEEE 754 standard to represent values as \((-1)^s\cdot M\cdot 2^E\) where:

  • \(s\) is one bit for the sign: 0 for positive, 1 for negative;
  • \(M\) is the significand, fractional binary 0–2;
  • \(E\) is the exponent, an integer.

The sign is easy enough. The other two need some explanation. They are more complicated than they sound…

IEEE Floating Point

We will use one bit for the sign, and split the others between bits for the significand (frac) and the exponent (exp).

A double precision floating point value (a C double) uses 64 bits: 1 for s + 11 for exp + 52 for frac:

A single precision floating point value (a C float) uses 32 bits: 1 for s + 8 for exp + 23 for frac.

IEEE 754 bits

IEEE Floating Point

How those bits get interpretted is a little complicated.

Note on notation: we're using exp and frac for the bits in the binary representation, and \(E\) and \(M\) for the values that go into the \((-1)^s\cdot M\cdot 2^E\) formula. There's translation between the two that we'll discuss.

IEEE Floating Point

Most IEEE 754 floating point values are normalized, so let's look at that case first. Values are normalized if exp is anything except all 0s or all 1s. Then to interpret the bits…

IEEE Floating Point

We decide on a bias for our exponent. For single precision, bias is 127; for double precision, it's 1023.

Then look at exp as an unsigned integer and, \( E = exp - bias \).

e.g. in single-precision we might have exp bits 01111001. Then, \[ E = 01111001_2 - \mathit{bias} = 121 - 127 = -6\,. \]

IEEE Floating Point

For normalized values, the frac bits are taken to be the right-of-decimal-point bits, and add one. e.g. if the single-precision frac is 01001011000000000000000, it represents \[\begin{align*} M &= 1 + 0.01001011000000000000000_2 \\ &= 1 + \tfrac{1}{4} + \tfrac{1}{32} + \tfrac{1}{128} + \tfrac{1}{256} \\ &= 1 + \tfrac{75}{256} = 1.29296875\,. \end{align*}\]

IEEE Floating Point

So, if the previous examples have sign bit 0, the 32-bit single-precision float would be sign+exp+frac:


and it represents

\[\begin{align*} (-1)^s\cdot M\cdot 2^E &= (-1)^0 \cdot 1.29296875 \cdot 2^{-6} \\ &= +0.02020263671875_{10} \,. \end{align*}\]

IEEE Floating Point

A floating point number is a denormalized value when the exp is all 0s.

Then the exponent is always \(E=1-\mathit{bias}\).

The frac values are interpreted as before: as the bits to the right of the decimal point, but without adding one.

IEEE Floating Point

For example, this is a denormalized single-precision value with the same frac bits as the previous example:


Since the bias is 127 for single-precision \(E=1-127=-126\) and this value is:

\[\begin{align*} (-1)^s\cdot M\cdot 2^E &= (-1)^0 \cdot 0.29296875 \cdot 2^{-126} \\ &\approx 3.4438311\times 10^{-39} \,. \end{align*}\]

IEEE Floating Point

For denormalized values, the significand is always going to be \(0\le M < 1\) and it's going to be multipled by a very-negative power of two (\(2^{-126}\) for single-precision; \(2^{-1022}\) for double-precision). So, all of the denormalized values are very close to zero.

IEEE Floating Point

For normalized values, the significand is \(1\le M < 2\) and multiplied by a power of two with a much larger range.

The normalized numbers cover most of the range for floating point values; the denormalized handle the close-to-zero range.

IEEE Floating Point

The normalized and denormalized values have a clean transition at \(2^{1-\mathit{bias}}\). The largest denormalized single:


\(= (0.111\ldots_2) \cdot 2^{1-\mathit{bias}} \approx 1.175494211\times 10^{-38} \).

The smallest (positive) normalized single:


\(= 1\cdot 2^{1-\mathit{bias}} \approx 1.175494351\times{}10^{-38}\).

IEEE Floating Point

There's a case we haven't looked at: exp is all 1s.

These represent special values. When frac is all 0, the value represents \(+\infty\) or \(-\infty\), depending on the sign bit.

e.g. in C, 1.0/0.0 gives inf.

IEEE Floating Point

With other frac values, the represented value is not a number or NaN.

e.g. in C, sqrt(-1.0) gives nan. It can also be used to represent something like missing data.

IEEE Floating Point

Or all of that, all in one table:

all 0\(1-\mathit{bias}\)anything0.frac\((-1)^s\cdot M\cdot 2^E\)
all 1all 0\((-1)^s\cdot\infty\)
all 1else\((-1)^s\cdot\mathrm{NaN}\)
else\(\mathit{exp}-\mathit{bias}\)anything1.frac\((-1)^s\cdot M\cdot 2^E\)

IEEE Floating Point

IEEE 754 specifies these floating point representations, and also how to do operations on them (+, /, sqrt, …), rounding rules, exception handling (e.g. 1.0/0.0 is inf; sqrt(-1.0) is nan), etc.

All (most? all normal?) modern CPUs and languages implement IEEE 754, so you have some assurance that if you do a floating point calculation in C on x86, or in Rust on ARM, you'll get exactly the same result.

Representing Values

To represent a real number \(x\) (as closely as possible) as a floating point value, we need to figure out the \(s\), \(M\), and \(E\) necessary for make \((-1)^s\cdot M\cdot 2^E\) equal to that value.

The \(s\) should be easy: 0 for positive, 1 for negative.

Representing Values

As long as we're not too close to 0, it will be a normalized value. Let's assume \(s=0\) (i.e. \(x\) is positive) for simplicity and then \(x = M\cdot 2^E\) and, \[ 1\le M < 2 \\ 2^E\le M\cdot 2^E < 2^{E+1} \]

So, \(2^E\) is the largest power of 2 less than or equal to \(x\) and \( E = exp - bias \).

Representing Values

Now we can figure out \(s\) and \(E\), so just have to choose an \(M\) so that, \[\begin{align*} (-1)^s\cdot M\cdot 2^E &= x \\ M &= \frac{x}{(-1)^s \cdot 2^E} \,. \end{align*}\]

We still know \(1\le M < 2\), so it's just a question of picking the right-of-decimal-point bits to make the closest possible approximation of our value.

Representing Values

Let's try to represent \(+\frac{1}{10}\) in single-precision…

We can start with \(s=0\) and \[2^{-4} \le \tfrac{1}{10} < 2^{-3}\,,\] so \(E=-4\). The bias for single-precision is 127 for single-precision, so exp needs to be \[\begin{align*} -4+127 &= 123_{10} \\ &= 01111011_{2} \end{align*}\]

Representing Values

Then, \[ M = \frac{x}{(-1)^s \cdot 2^E} = \frac{\frac{1}{10}}{1\cdot 2^{-4}} = \tfrac{16}{10} = \tfrac{8}{5}\,. \]

Now we have 23 bits for frac and just need to get as close as possible to \(M=\tfrac{8}{5}\). \[\begin{align*} \tfrac{8}{5} &\approx 1 + \tfrac{1}{2^{1}} + \tfrac{1}{2^{4}} + \tfrac{1}{2^{5}} + \tfrac{1}{2^{8}} + \tfrac{1}{2^{9}} + \tfrac{1}{2^{12}} + \tfrac{1}{2^{13}}+\cdots \\ &\approx 1.10011001100110011001101_2\,. \end{align*}\]

… and take the right-of-decimal-point bits as frac.

Representing Values

After all of that, we got this for \(\frac{1}{10}\):


Or we could have used a converter. We can double-check in C:

uint32_t bits = 0b00111101110011001100110011001101;
float f = *(float*)&bits;
printf("%.15f\n", f);

Representing Values

Another example that's work remembering…

The all-0-bits value in any IEEE floating point will have exp all-0s, so it's a denormalized value. Also, \(s=0\) so it's positive.

The significand is going to be 0.000…, so it represents

\[ (-1)^s\cdot M\cdot 2^E = (-1)^0\cdot 0.0\cdot 2^E = +0.0\,. \]

If the first bit (\(s\)) was 1, that represents \(-0.0\).

Representing Values

Small integers can be represented exactly. For example, \(1=+2^0\cdot{}1\) so we can represent it with a normalized value, \(E=0\) and \(S=1.000\ldots\).

In single-precision, that would be:


Representing Values

Also, rational numbers with (reasonably-sized) power-of-two denominators: \(3/2=1.5\) is:




Sizes of FP

There are several sizes of IEEE floating point values that are commonly used. We have seen single- and double-precision, but there are others (later).

Sizes of FP

The largest and smallest values possible for a floating point type is controlled mostly by the number of bits in the exponent: we want the largest possible exponent without being one of the special values (111…1110) and the largest possible significand (111…111).


For single-precision, that's \(\pm 3.40\times 10^{38}\). For double, \(\pm 1.80\times 10^{308}\).

Sizes of FP

The number of bits in the significand determines how precisely we can represent a value: more bits means more precision in \(M\).

Sizes of FP

For example, we couldn't represent \(\frac{1}{10}\) exactly, but the closest single-precision values are: \[ 0.0999999940395355224609 \\ 0.1000000014901161193850 \]

In double-precision, the closest choices are: \[ 0.0999999999999999916733 \\ 0.1000000000000000055510 \]

Sizes of FP

So we have a tradeoff: more bits means more range and precision, but also more memory used for each value.

Storing 32- vs 64-bits doesn't sound like a big deal, but if you have an array of millions of values, it might be. There are also times when calculating on single-precision is about twice as fast as double-precision. [later topic: “SIMD”]

There are some other choices of floating-point size you might see…

Sizes of FP

NameTotal Bitsexp bitsfrac bitsBiasCommon Use
Double precision6411521023precision or range matters, general use
Single precision32823127if tradeoff of precision/​range for size/​speed makes sense
Half precision1651015graphics, deep learning where speed matters much more than precision/​range
Minifloat8431?Lecture examples? Deep Learning?

Sizes of FP

There are x86 instructions for single- and double-precision, and those are by far the most commonly used.

The (not yet well supported) AVX-512 instructions add some half precision vector instructions.

Bfloat16 is supported in Apple M2 chips and some machine learning accelerators (like Google's TPUs).

Representable Values

We're going to have a limited number of bits we can store (probably 32 or 64), so we can't represent every real (or rational) number: there are too many of them.

Representable Values

In decimal, we can't represent \(\frac{1}{3}\) exactly (with a finite number of digits). We can get close with \(0.3333333_{10}\), but never an exactly equal value.

In floating point, we saw that we can't represent 1/10 exactly, but we can get pretty close in single-precision. That's what you get when you type float x = 0.1 in C.

Doing double x = 0.1 gets closer, but still not exact.

Representable Values

This isn't a C-specific problem. Many programmers are alarmed by things like this in Python (which uses double-precision IEEE 754 for its float type):

>>> 0.1 + 0.2
>>> 0.3
>>> 0.1 + 0.2 == 0.3

The rounding-off that's done with 0.1 and 0.2 cause their sum to be slightly different than the rounding on the literal value 0.3.

Representable Values

Languages make different choices about what they display, but if they're using IEEE 754, the results are the same. This…

double x1 = 0.1;
double x2 = 0.2;
double x3 = 0.3;
printf("%g %g\n", x1 + x2, x3);


0.3 0.3

Representable Values

But this…

printf("%.17g\n", x1 + x2);
printf("%.17g\n", x3);
printf("%s\n", x1+x2 == x3 ? "true" : "false");



Representable Values

Note that even the smallest difference, many decimal places away from the part of the number you care about makes floating point values not-equal. Differences in rounding make it hard to predict where they will occur.

>>> 0.29999999999999999 == 0.3
>>> 0.30000000000000004 == 0.3
>>> 0.29999999999999999 == 0.30000000000000004
>>> 0.6/2 == 0.1+0.2

Representable Values

As a result, comparing floating point values for equality is almost never the right thing to do.

double x = ...
double y = ...
if ( x == y ) { /* nope */ }
if ( abs(x - y) < 0.0000001 ) { /* better */ }

Representable Values

The two 16-bit floating point options, half and bfloat16, can give us a better idea of the tradeoffs in the representation.

Half-precision: 5 bits for exp, 10 for frac, and a bias of 15.

Bfloat16: 8 bits for exp, 7 for frac, and a bias of 127.

Representable Values

Half-precision is balanced more like single and double: many more bits for frac then exp. That leaves it with a smaller range: the limits are \(\pm 65504\).

Bfloat16 has the same number of bits for exp as single precision and takes the 16 bits from frac. Its limits are \(\pm 3.39\times 10^{38}\): basically the same as singles. It chooses range over precision.

Representable Values

For half-precision values, the next value after 1.0 that can be represented is

0011110000000001 = 1.0009765625.

For bfloat16, it's

0011111110000001 = 1.0078125.

Less precision, more range. It was designed for applications like accelerated machine learning where speed is much more important than precision, so that makes sense.

Representable Values

With 64 bits, we get floating point range \(\pm 1.80\times 10^{308}\) and with 32 bits, \(\pm 3.40\times 10^{38}\).

For 64-bit signed integers, \(\pm 9.22\times 10^{18}\) and for 32-bit signed, \(\pm 2.15\times 10^{9}\).

So if we're working with big numbers, floating point loses precision, but we get much larger range

Representable Values

The design of the floating point representation ensures that representable values are closer together near zero, and farther apart for larger values (or more negative values).

It's hard to see the details, but easier if we drop to an extremely small example…

Representable Values

Let's look at an 8-bit floating point values with 1 bit for sign, 4 for exponent, 3 for significand, and a bias of 1. With one byte, there are only 256 unique values, and they are listed on the Wikipedia Minifloat page.

The rows that start 0 0000 and 1 0000 are the denormalized values. The 0 1111 and 1 1111 rows are the special values. The rest are normalized values.

Representable Values

Some observations:

  • Within each exponent, there is a linear sequence of values that can be represented (e.g. 1, 1.125, 1.25, 1.375, 1.5, 1.625, 1.75, 1.875 for exp=0001).
  • The error you're going to get on a single value (worst-possible rounding from truth to a representable value) is relative to the magnitude of the value.

Representable Values

  • Within its range, minifloat values can represent values within about 5% of their true value. Bad, but consistently bad across the range.
  • Positive and negative zero are different values (e.g. in double precision, 1/1e400 == +0.0 and -1/1e400 == -0.0).

Representable Values

Single- and double-precision are basically that, but more.

More bits for exp mean a larger range, and more precision around zero (because more powers-of-2 in both the positive and negative direction).

More bits for frac mean more precision in general: more bits to the right of the decimal place, so can represent values more exactly.

Limits of FP

Precision errors are something you might not like in floating point calculations, but can't avoid. For more: MACM 316.

There may be cases where floating point simply is not the right data type for the calculations you're doing, even though it might look like it initially.

Limits of FP

e.g. for arithmetic on money values, you would be annoyed to find out that $0.20 + $0.10 ≠ $0.30. The usual advice is to not think of $12.34 as the float 12.34, but as an integer number of cents, 1234¢ (or integer number of whatever the smallest unit your currency has).

Then do integer arithmetic and format as dollars only for display.

Limits of FP

e.g. There are libraries for rational numbers (fractions.Fraction in Python, Boost's Rational class for C++) and arbitrary-precision floating point (decimal.Decimal in Python, Boost.Multiprecision for C++).

Calculations on them will be slower than raw floating point calculations, but correct code is better than fast code: if you need them, use them.

Limits of FP

An application of FP that's maybe surprising: recording audio.

To store audio, we need to represent a sample of the sound wave many times per second (44.1 kHz or so). How do we represent those samples? Traditionally with 16-bit integers (maybe 24- or 32-bit).

e.g. CD quality audio = 16-bit integers, 44.1 kHz.

Limits of FP

The issue for the moment: if the audio is too loud, it can clip. Basically, the tops/​bottom of the waveforms get cut off.

clipping waveform
Wikimedia commons File:Clipping waveform.svg

Limits of FP

When recording audio, levels can be very different, depending on the situation (microphone sensitivity, loudness of the sound, how close the microphone is).

Traditionally, it was very important to set gain so the levels were large enough to be captured (there aren't many integers near zero, so the signal needs to be relatively loud so the detail is there), but not so high signals clip (which distorts the sound).

Limits of FP

For 16-bit signed integers, we could represent the level of the waveform in the range -32768–32767.

These standards come from a time when storage and compute were very limited: even 16-bit integers were good enough and we could store them and manipulate them (44.1k times per second) on computers we had at the time.

Limits of FP

But if we switched to 32-bit floating point values, the range would become \(\pm 3.40\times 10^{38}\) and clipping is almost impossible.

Also, really quiet sounds still have good fidelity because of the precision of floats closer to zero.

Limits of FP

It turns out 32-bit float audio is a thing. It's amusing to hear people talk about single-precision float like it's a new thing.

Basically, sound level doesn't matter when recording 32-bit float (until things are so loud your microphone was probably going to explode anyway). Fix it in post.

Limits of FP

When it comes to audio playback, integers are probably fine: levels have to be in a kind-of-reasonable range so different audio has similar levels. (i.e. you expect volume at 70% should be similarly-loud for all songs, so weirdly different levels aren't worth worrying about.)

Compilers and FP

One thing that should be clear: IEEE floating point numbers are a decent approximation of real numbers (i.e. \(\mathbb{R}\) from the math class), but they are definitely different. For real numbers, we know the basic operations are associative:

\[ (a+b)+c = a+(b+c)\,. \]

Compilers and FP

Addition and multiplication on floating point numbers is not associative:

>>> (0.7 + 0.2) + 0.1
>>> 0.7 + (0.2 + 0.1)
>>> (0.12345678 * 0.12345678) * 0.56789
>>> 0.12345678 * (0.12345678 * 0.56789)

In other words: algebra (as you understand it) doesn't work on floats.

\[ (a+b)+c \approx a+(b+c)\,. \]

Compilers and FP

It turns out we have seen this already. You just didn't notice.

When adding arrays in the pipeline demo, I used float (single precision) as the array elements.

#define DATA_T float

Then did a different calculation that had different performance…

Compilers and FP

DATA_T array_sum_1(DATA_T* arr, uint64_t n) {
    DATA_T total = 0;
    for (uint64_t i = 0; i < n; i++) {
        total += arr[i];
    return total;
DATA_T array_sum_3(DATA_T* arr, uint64_t n) {
    DATA_T total1 = 0, total2 = 0;
    for (uint64_t i = 0; i < n - 1; i += 2) {
        total1 += arr[i];
        total2 += arr[i + 1];
    return total1 + total2;

Compilers and FP

The output from summing the same array of small numbers with those functions:

array_sum_1  ...  Result: 173.1003113
array_sum_2  ...  Result: 173.1003113
array_sum_3  ...  Result: 173.1308441

The additions were done in a different order, so the rounding errors accumulated differently.

Summing many small values (exactly like this problem) is one of the places rounding errors can get really bad.

Compilers and FP

I also said at the time that the speed difference in that demo disappeared when changing from float to an integer type.

At least part of the reason: integer addition is associative*. The compiler is free to change the order of operations all it wants, because the result of the calculation will be the same. It might have done that optimization for us.

Compilers and FP

The compiler won't rearrange floating point operations because of the implicit contract you have with it: the compiled code is going to produce exactly the result you asked for.

The compiler can rearrange/​change/​skip operations as much as it likes, as long as the result is exactly the same.

Compilers and FP

The result: we'll generally see much more modest compiler optimizations around FP code. The compiler is almost always bound to do exactly the calculations you requested, in exactly the order you wrote them.

We'll talk more later about how to ask the compiler nicely to be less precise with floating point operations.