We need to talk about how computers represent floating point values. Just like for integers, there are details that matter when it comes to working with them.
But before that, let's actually use them.
A quick primer before we do… there are two commonly-used floating point types: single- and double-precision.
Single is C's float
type is 32 bits and can represent values with ≈8 (decimal) digits of precision.
Double is C's double
type (and Python's float
and JavaScript's numeric type and …) is 64 bits and gives ≈15 significant digits.
There are two sets of floating point registers and sets of instructions in x86-64 processors (because of course there are).
Originally, the x86 processors didn't have floating point instructions: if a programmer (or compiler) wanted to do floating point stuff, it had to be done with whatever combination of other instructions (i.e. functions/libraries) would do the job.
The x87 instructions were implemented on optional coprocessors before the i486 (1989), and were then part of the processor itself.
They use registers ST(0)
to ST(7)
(optionally) as a stack. i.e. there's a zero-operand instruction that can pop ST(0)
and ST(1)
, divide them, and push the result into ST(0)
.
We're going to ignore the x87 instructions in favour of the newer instructions. Basically: if you see ST(n)
in the description of an instruction, that's not the one you want.
Compilers seem to do the same: with -march=haswell
(or other appropriate architecture specified), they seem to never emit x87 instructions.
We're going to work with the instructions that were introduced in the MMX instruction set and extended as SSE (Streaming SIMD Extensions), SSE2, SSE3, SSSE3, SSE4, AVX (Advanced Vector Extensions), and AVX2 (and AVX-512 that's not yet well supported, certainly not in our Haswell baseline).
I'm going to call them the vector instructions
, because I don't think they collectively have a name, even though we aren't (yet) using them as vector operations.
The vector instructions work on registers %xmm0
to %xmm15
(and their wider siblings %ymm0
–%ymm15
: more later).
There are a lot of registers, which is good, but none of them is call-preserved.
The streaming instructions can work on multiple values at once (vector processing or SIMD, single-instruction multiple-data) but we're going to come back to that later. [later topic: “SIMD”]
For now: look for %xmmn
in the docs, and the word scalar
in the description of the instruction, which indicates the non-vector version.
There are all of these ways to divide (and a few more). Of these, fdiv
is not the one you want: it's x87.
div
: unsigned integer divide on the general purpose registers (%rax
and friends).idiv
: signed integer divide on the general purpose registers.fdiv
: x87 double-precision floating point divide.divss
: Divide Scalar Single-Precision Floating-Point Value.
divsd
: Divide Scalar Double-Precision Floating-Point Value.
Remember the calling convention we have been using for assembly: return value in %rax
, first argument in %rdi
, second argument in %rsi
, ….
Those apply to integer (and pointer) arguments.
For floating point values, the arguments are in %xmm0
, %xmm1
, …, %xmm8
. A floating point return value must be put in %xmm0
.
If you have both integer (or pointer) and floating point arguments to a function, %rdi
is the first integer argument and %xmm0
is the first floating point argument, no matter where they are in the list.
So in this function, a
and b
will be in %rdi
and %rsi
; x
and y
will be in %xmm0
and %xmm1
; return in %rax
.
uint64_t f(uint64_t a, uint64_t b, double x, double y);
Here, length
and width
in %xmm0
and %xmm1
; count
in %rdi
; return in %xmm0
.
double f(double length, double width, uint64_t count);
With those details out of the way, working with floating point is mostly straightforward.
Let's reproduce the first-ever assembly from the lab, except a floating-point version. The equivalent of:
double multiply_add_c(double a, double b, double c) { return a + b * c; }
There are instructions for double-precision floating point addition and multiplicaton: addsd
(ADD Scalar Double precision
) and mulsd
(MULtiply Scalar Double precision
).
Then to calculate and return a + b * c
:
# %xmm0: a and return # %xmm1: b # %xmm2: c multiply_add: mulsd %xmm2, %xmm1 # b *= c addsd %xmm1, %xmm0 # a += b ret
The floating point instruction set is much richer than integers in many ways (that we mostly won't discuss). It turns out there's a single instruction for that:
multiply_add_2: vfmadd231sd %xmm1, %xmm2, %xmm0 ret
The catchy mneumonic
and name vfmadd231sd
Fused Multiply-Add of Scalar Double Precision Floating-Point Values
are perhaps a hint that some of this functionality is designed more for compilers than humans.
But for the more normal
floating point instructions, they work like the integer instructions, implementing in-place (+=
and similar) operations. Just with different registers.
One notable thing we lose for floating point: immediate values or constants. There is no floating point/%xmmn
register equivalent of:
mov $1234, %rax
There is an idiom to set a register to zero. Three relevant facts:
pxor
is the %xmmn
register XOR operation.This will set %xmm4
to 0.0:
pxor %xmm4, %xmm4
Or an instruction that sets all of the %xmmn
registers to zero, in case you really want to clean up:
vzeroall
Otherwise, the easiest thing to do it find the in-memory representation of the floating point value (which for now we just have to accept exists: details later).
Those bits can be loaded into a register.
Let's say we want to calculate the circumference of a circle: \(\pi\) times the diameter.
We need to find out how to represent \(\pi\) in binary in our code (in a .data
segment). I copied-and-pasted a decimal representation of \(\pi\) into float.exposed to find if I treat the bits like a double-precision float,
0x400921fb54442d18
≈ \(\pi\)
Then we can do the calculation by loading the constant through a general-purpose register:
circumference: movabs $0x400921fb54442d18, %rdi movq %rdi, %xmm1 mulsd %xmm1, %xmm0 ret
The movq
instruction is Move Quadword
. It's not mov
with the q
suffix.
mulsd
: Multiply Scalar Double Precision Floating-Point Value
.
Or we could load the constant from memory if we prefer:
.data pi: .quad 0x400921fb54442d18
circumference2: movq pi(%rip), %xmm1 mulsd %xmm1, %xmm0 ret
Let's try another problem similar to what we have done on integers: a comparison function that returns -1 for a<b
; 1 for a>b
; and 0 for a==b
.
The comisd
instruction (Compare Scalar Double
) will compare similarly to cmp
and set flags as cmp
would for unsigned integers. (i.e. ja
and jb
not jg
and jl
, and corresponding conditional moves.)
I'm going to use conditional moves:
double_cmp: mov $-1, %r11 # constant for cmov mov $1, %r10 # constant for cmov mov $0, %rax comisd %xmm0, %xmm1 cmova %r11, %rax cmovb %r10, %rax ret
Swapping in cmp %rdi, %rsi
would make this the unsigned integer version of the function.
Note that the cmovx instructions do not modify the status bits: we can do one status-bit-setting operation and then use those bits several times.
There is a ptest
instruction that's analogous to test
, but it doesn't set the sign flag, so probably comisd
and zero is a better tool for us to remember.
One more complication for floating point values: the push
and pop
instruction cannot be used with the %xmmn
registers.
But, remember that these code snippets do exactly the same thing:
push %rbx ⋮ pop %r14
sub $8, %rsp mov %rbx, (%rsp) ⋮ mov (%rsp), %r14 add $8, %rsp
The %xmmn
registers can be read from or written to memory, so we can use the stack manually
if we need to:
sub $8, %rsp movq %xmm5, (%rsp) ⋮ movq (%rsp), %xmm6 add $8, %rsp
Yet another hint that this end of the instruction set is designed more for compilers than people.
That's probably enough floating point instructions for now. Let's have a better look at what floating point
actually is…
Let's start with how fractional point values turn into binary.
Recall how we dealt with unsigned integers: in decimal, each digit represents the next power of ten. In binary, each bit represents the next power of two:
\[\begin{align*} 1001_{10} &= (1\times 10^3) + (0\times 10^2) + (0\times 10^1) + (1\times 10^0) \,, \\ 1001_2 &= (1\times 2^3) + (0\times 2^2) + (0\times 2^1) + (1\times 2^0) \,. \end{align*}\]
When we have fractional parts in decimal, we just add a decimal point, and count down negative powers of ten as we go right:
\[\begin{align*} 2.34_{10} = (2\times 10^0) + (3\times 10^{-1}) + (4\times 10^{-2}) \,. \end{align*}\]
We can do the same thing in binary:
\[\begin{align*} 10.01_{2} &= (1\times 2^1) + (0\times 2^0) + (0\times 2^{-1}) + (1\times 2^{-2}) \\ &= 2.25_{10}\,. \end{align*}\]
The positions to the right of the decimal
place in binary are 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256, … .
We can deal with values the same was as decimal. We can convert the fraction 557/128 to binary:
\[\begin{align*} \frac{557}{128} &= 4 + \frac{0}{2} + \frac{1}{4} + \frac{0}{8} + \frac{1}{16} + \frac{1}{32} + \frac{0}{64} + \frac{1}{128} \\ &= 100.0101101_2\,. \end{align*}\]
Not every value that has a non-repeating decimal expansion terminates in binary. e.g. 1/10:
\[ \tfrac{1}{10} = 0.1_{10} \,. \]
But in binary:
\[\begin{align*}\ \tfrac{1}{10} &= \tfrac{1}{16} + \tfrac{1}{32} + \tfrac{1}{256} + \tfrac{1}{512} + \tfrac{1}{4096} + \tfrac{1}{8192} + \cdots \\ &= 0.0001100110011\cdots \\ &= 0.0\overline{0011}_{2}\,. \end{align*}\]
Binary is straightforward enough, but what about bits? How can we store binary fractional things in a reasonable way in a computer?
Just like with integers, we will start by deciding how many bits we want to store for each value, typically 32 or 64.
We will use the IEEE 754 standard to represent values as \((-1)^s\cdot M\cdot 2^E\) where:
The sign is easy enough. The other two need some explanation. They are more complicated than they sound…
We will use one bit for the sign, and split the others between bits for the significand (frac) and the exponent (exp).
A double precision floating point value (a C double
) uses 64 bits: 1 for s, 11 for exp, and 52 for frac:
A single precision floating point value (a C float
) uses 32 bits: 1 for s, 8 for exp, and 23 for frac.
How those bits get interpretted is a little complicated.
Note on notation: we're using exp and frac for the bits in the binary representation, and \(E\) and \(M\) for the values that go into the \((-1)^s\cdot M\cdot 2^E\) formula. There's translation between the two that we'll discuss.
Most IEEE 754 floating point values are normalized, so let's look at that case first. Values are normalized if exp is anything except all 0s or all 1s. Then to interpret the bits…
We decide on a bias for our exponent. For single precision, bias is 127; for double precision, it's 1023.
Then look at exp as an unsigned integer and, \( E = exp - bias \).
e.g. in single-precision we might have exp bits 01111001. Then, \[ E = 01111001_2 - \mathit{bias} = 121 - 127 = -6\,. \]
For normalized values, the frac bits are taken to be the right-of-decimal-point bits, and add one. e.g. if the single-precision frac is 01001011000000000000000, it represents \[\begin{align*} M &= 1 + 0.01001011000000000000000_2 \\ &= 1 + \tfrac{1}{4} + \tfrac{1}{32} + \tfrac{1}{128} + \tfrac{1}{256} \\ &= 1 + \tfrac{75}{256} = 1.29296875\,. \end{align*}\]
So, if the previous examples have sign bit 0, the 32-bit single-precision float would be sign+exp+frac:
00111100101001011000000000000000
and it represents
\[\begin{align*} (-1)^s\cdot M\cdot 2^E &= (-1)^0 \cdot 1.29296875 \cdot 2^{-6} \\ &= +0.02020263671875_{10} \,. \end{align*}\]
A floating point number is a denormalized value when the exp is all 0s.
Then the exponent is always \(E=1-\mathit{bias}\).
The frac values are interpreted as before: as the bits to the right of the decimal point, but without adding one.
For example, this is a denormalized single-precision value with the same frac bits as the previous example:
00000000001001011000000000000000
Since the bias is 127 for single-precision \(E=1-127=-126\) and this value is:
\[\begin{align*} (-1)^s\cdot M\cdot 2^E &= (-1)^0 \cdot 0.29296875 \cdot 2^{-126} \\ &\approx 3.4438311\times 10^{-39} \,. \end{align*}\]
For denormalized values, the significand is always going to be \(0\le M < 1\) and it's going to be multipled by a very-negative power of two (\(2^{-126}\) for single-precision; \(2^{-1022}\) for double-precision). So, all of the denormalized values are very close to zero.
For normalized values, the significand is \(1\le M < 2\) and multiplied by a power of two with a much larger range.
The normalized numbers cover most of the range for floating point values; the denormalized handle the very-close-to-zero values (and zero itself).
The normalized and denormalized values have a clean transition at \(2^{1-\mathit{bias}}\). The largest denormalized single:
00000000011111111111111111111111
\(= (0.111\ldots_2) \cdot 2^{1-\mathit{bias}} \approx 1.175494211\times 10^{-38} \).
The smallest positive normalized single:
00000000100000000000000000000000
\(= 1\cdot 2^{1-\mathit{bias}} \approx 1.175494351\times{}10^{-38}\).
There's a case we haven't looked at: exp is all 1s.
These represent special values. When frac is all 0, the value represents \(+\infty\) or \(-\infty\), depending on the sign bit.
e.g. in C, 1.0/0.0
gives inf
.
With other frac values, the represented value is not a number or NaN.
e.g. in C, sqrt(-1.0)
gives nan
. It can also be used to represent something like missing data
.
Or all of that, all in one table:
exp | \(E\) | frac | \(M\) | Meaning |
---|---|---|---|---|
all 0 | \(1-\mathit{bias}\) | anything | 0.frac | \((-1)^s\cdot M\cdot 2^E\) |
all 1 | all 0 | \((-1)^s\cdot\infty\) | ||
all 1 | else | \((-1)^s\cdot\mathrm{NaN}\) | ||
else | \(\mathit{exp}-\mathit{bias}\) | anything | 1.frac | \((-1)^s\cdot M\cdot 2^E\) |
IEEE 754 specifies these floating point representations, and also how to do operations on them (+
, /
, sqrt
, …), rounding rules, exception handling (e.g. 1.0/0.0
is inf
; sqrt(-1.0)
is nan
), etc.
All (most? all normal
?) modern CPUs and languages implement IEEE 754, so you have some assurance that if you do a floating point calculation in C on x86, or in Rust on ARM, you'll get exactly the same result.
To represent a real number \(x\) (as closely as possible) as a floating point value, we need to figure out the \(s\), \(M\), and \(E\) necessary for make \((-1)^s\cdot M\cdot 2^E\) equal to that value.
The \(s\) should be easy: 0 for positive, 1 for negative.
As long as we're not too close to 0, it will be a normalized value. Let's assume \(s=0\) (i.e. \(x\) is positive) for simplicity and then \(x = M\cdot 2^E\) and, \[ 1\le M < 2 \\ 2^E\le M\cdot 2^E < 2^{E+1} \]
So, \(2^E\) is the largest power of 2 less than or equal to \(x\) and \( E = exp - bias \).
Now we can figure out \(s\) and \(E\), so just have to choose an \(M\) so that, \[\begin{align*} x &= (-1)^s\cdot M\cdot 2^E \\ M &= \frac{x}{(-1)^s \cdot 2^E} \,. \end{align*}\]
We still know \(1\le M < 2\), so it's just a question of picking the right-of-decimal-point bits to make the closest possible approximation of our value.
Let's try to represent \(+\frac{1}{10}\) in single-precision…
We can start with \(s=0\) and \[2^{-4} \le \tfrac{1}{10} < 2^{-3}\,,\] so \(E=-4\). The bias for single-precision is 127 for single-precision, so exp needs to be \[\begin{align*} -4+127 &= 123_{10} \\ &= 01111011_{2} \end{align*}\]
Then, \[ M = \frac{x}{(-1)^s \cdot 2^E} = \frac{\frac{1}{10}}{1\cdot 2^{-4}} = \tfrac{16}{10} = \tfrac{8}{5}\,. \]
Now we have 23 bits for frac and just need to get as close as possible to \(M=\tfrac{8}{5}\). \[\begin{align*} \tfrac{8}{5} &\approx 1 + \tfrac{1}{2^{1}} + \tfrac{1}{2^{4}} + \tfrac{1}{2^{5}} + \tfrac{1}{2^{8}} + \tfrac{1}{2^{9}} + \tfrac{1}{2^{12}} + \tfrac{1}{2^{13}}+\cdots \\ &\approx 1.10011001100110011001101_2\,. \end{align*}\]
… and take the right-of-decimal-point bits as frac.
After all of that, we got this for \(\frac{1}{10}\):
00111101110011001100110011001101
Or we could have used a converter. We can double-check in C:
uint32_t bits = 0b00111101110011001100110011001101; float f = *(float*)&bits; // treat that as single-precison printf("%.15f\n", f);
0.100000001490116
Another example that's worth remembering…
The all-0-bits value in any IEEE floating point will have exp all-0s, so it's a denormalized value. Also, \(s=0\) so it's positive.
The significand is going to be 0.000…, so it represents
\[ (-1)^s\cdot M\cdot 2^E = (-1)^0\cdot 0.0\cdot 2^E = +0.0\,. \]
If the first bit (\(s\)) was 1, that represents \(-0.0\).
Small integers can be represented exactly. For example, \(1=+2^0\cdot{}1\) so we can represent it with a normalized value, \(E=0\) and \(S=1.000\ldots\).
In single-precision, that would be:
00111111100000000000000000000000
Also, rational numbers with (small-ish) power-of-two denominators: \(3/2=1.5\) is:
00111111110000000000000000000000
\(-3/8=-0.375\):
10111110110000000000000000000000
There are several sizes
of IEEE floating point values that are commonly used. We have seen single- and double-precision, but there are others (later).
The largest and smallest values possible for a floating point type is controlled mostly by the number of bits in the exponent: we want the largest possible exponent without being one of the special
values (111…1110) and the largest possible significand (111…111).
0111…110111111…111111
For single-precision, that's \(\pm 3.40\times 10^{38}\). For double, \(\pm 1.80\times 10^{308}\).
The number of bits in the significand determines how precisely we can represent a value: more bits means more precision in \(M\).
For example, we couldn't represent \(\frac{1}{10}\) exactly, but the closest single-precision values are:
0.0999999940395355224609
0.1000000014901161193850
In double-precision, the closest choices are:
0.0999999999999999916733
0.1000000000000000055510
So we have a tradeoff: more bits means more range and precision, but also more memory used for each value.
Storing 32- vs 64-bits doesn't sound like a big deal, but if you have an array of millions of values, it might be. There are also times when calculating on single-precision is about twice as fast as double-precision. [later topic: “SIMD”]
There are some other choices of floating-point size you might see…
Name | Total Bits | exp bits | frac bits | Bias | Common Use |
---|---|---|---|---|---|
Double precision | 64 | 11 | 52 | 1023 | precision or range matters, general use |
Single precision | 32 | 8 | 23 | 127 | if tradeoff of precision/range for size/speed makes sense |
Half precision | 16 | 5 | 10 | 15 | graphics, deep learning where speed matters much more than precision and/or range |
bfloat16 | 16 | 8 | 7 | 127 | |
Minifloat | 8 | 4 | 3 | 1? | Lecture examples? Deep Learning? |
There are x86 instructions for single- and double-precision, and those are by far the most commonly used.
The (not yet widely supported) AVX-512 instructions add some half precision and bfloat16 vector instructions.
Bfloat16 is supported in Apple M2 chips and some machine learning accelerators (like Google's TPUs).
We're going to have a limited number of bits we can store (probably 32 or 64), so we can't represent every real (or rational) number: there are too many of them.
In decimal, we can't represent \(\frac{1}{3}\) exactly (with a finite number of digits). We can get close with \(0.3333333_{10}\), but never an exactly equal value.
In floating point, we saw that we can't represent 1/10 exactly, but we can get pretty close in single-precision. That's what you get when you type
in C.float x = 0.1
Doing
gets closer, but still not exact.double x = 0.1
This isn't a C-specific problem. Many programmers are alarmed by things like this in Python (which uses double-precision IEEE 754 for its float
type):
>>> 0.1 + 0.2 0.30000000000000004 >>> 0.3 0.3 >>> 0.1 + 0.2 == 0.3 False
The rounding-off that's done with 0.1
and 0.2
cause their sum to be slightly different than the rounding on the literal value 0.3
.
Languages make different choices about what they display, but if they're using IEEE 754, the results are the same. This…
double x1 = 0.1; double x2 = 0.2; double x3 = 0.3; printf("%g %g\n", x1 + x2, x3);
prints:
0.3 0.3
But this…
printf("%.17g\n", x1 + x2); printf("%.17g\n", x3); printf("%s\n", x1+x2 == x3 ? "true" : "false");
prints:
0.30000000000000004 0.29999999999999999 false
Note that even the smallest difference, many decimal places away from the part of the number you care about makes floating point values not-equal. Differences in rounding make it hard to predict where they will occur.
>>> 0.29999999999999999 == 0.3 True >>> 0.30000000000000004 == 0.3 False >>> 0.29999999999999999 == 0.30000000000000004 False >>> 0.6/2 == 0.1+0.2 False
As a result, comparing floating point values for equality is almost never the right thing to do.
double x = ... double y = ... if ( x == y ) { /* nope */ } if ( abs(x - y) < 0.0000001 ) { /* better */ }
The two 16-bit floating point options, half and bfloat16, can give us a better idea of the tradeoffs in the representation.
Half-precision: 5 bits for exp, 10 for frac, and a bias of 15.
Bfloat16: 8 bits for exp, 7 for frac, and a bias of 127.
Half-precision is balanced more like single and double: many more bits for frac then exp. That leaves it with a smaller range: the limits are \(\pm 65504\).
Bfloat16 has the same number of bits for exp as single precision and takes the 16 bits from frac. Its limits are \(\pm 3.39\times 10^{38}\): basically the same as singles. It chooses range over precision.
For half-precision values, the next value after 1.0 that can be represented is
0011110000000001 = 1.0009765625.
For bfloat16, it's
0011111110000001 = 1.0078125.
Less precision, more range. It was designed for applications like accelerated machine learning where speed is much more important than precision, so that makes sense.
With 64 bits, we get floating point range \(\pm 1.80\times 10^{308}\) and with 32 bits, \(\pm 3.40\times 10^{38}\).
For 64-bit signed integers, \(\pm 9.22\times 10^{18}\) and for 32-bit signed, \(\pm 2.15\times 10^{9}\).
So if we're working with big numbers, floating point loses precision, but we get much larger range
The design of the floating point representation ensures that representable values are closer together near zero, and farther apart for larger values (or more negative values).
It's hard to see the details, but easier if we drop to an extremely small example…
Let's look at an 8-bit floating point values with 1 bit for sign, 4 for exponent, 3 for significand, and a bias of 1. With one byte, there are only 256 unique values, and they are listed on the Wikipedia Minifloat page.
The rows that start 0 0000
and 1 0000
are the denormalized values. The 0 1111
and 1 1111
rows are the special values. The rest are normalized values.
Some observations:
truthto a representable value) is relative to the magnitude of the value.
truevalue. Bad, but consistently bad across the range.
1/1e400 == +0.0
and -1/1e400 == -0.0
).Single- and double-precision are basically that, but more.
More bits for exp mean a larger range, and more precision around zero (because more powers-of-2 in both the positive and negative direction).
More bits for frac mean more precision in general: more bits to the right of the decimal place, so can represent values more exactly.
Precision errors are something you might not like in floating point calculations, but can't avoid. For more: MACM 316.
There may be cases where floating point simply is not the right data type for the calculations you're doing, even though it might look like it initially.
e.g. for arithmetic on money values, you would be annoyed to find out that $0.20 + $0.10 ≠ $0.30. The usual advice is to not think of $12.34
as the float 12.34, but as an integer number of cents, 1234¢
(or integer number of whatever the smallest unit your currency has).
Then do integer arithmetic and format as dollars only for display.
e.g. There are libraries for rational numbers (fractions.Fraction
in Python, Boost's Rational class for C++) and arbitrary-precision floating point (decimal.Decimal
in Python, Boost.Multiprecision for C++).
Calculations on them will be slower than raw floating point calculations, but correct code is better than fast code: if you need them, use them.
An application of FP that's maybe surprising: recording audio.
To store audio, we need to represent a sample
of the sound wave many times per second (44.1 kHz or so). How do we represent those samples? Traditionally with 16-bit integers (maybe 24- or 32-bit).
e.g. CD quality
audio = 16-bit integers, 44.1 kHz.
The issue for the moment: if the audio is too loud, it can clip
. Basically, the tops/bottom of the waveforms get cut off.
When recording audio, levels can be very different, depending on the situation (microphone sensitivity, loudness of the sound, how close the microphone is).
Traditionally, it was very important to set gain
so the levels were large enough to be captured (there aren't many integers near zero, so the signal needs to be relatively loud so the detail is there), but not so high signals clip (which distorts the sound).
For 16-bit signed integers, we could represent the level
of the waveform in the range -32768–32767.
These standards come from a time when storage and compute were very limited: even 16-bit integers were good enough and we could store them and manipulate them (44.1k times per second) on computers we had at the time.
But if we switched to 32-bit floating point values, the range would become \(\pm 3.40\times 10^{38}\) and clipping is almost impossible.
Also, really quiet sounds still have good fidelity because of the precision of floats closer to zero.
It turns out 32-bit float audio is a thing. It's amusing to hear people talk about single-precision float like it's a new thing.
Basically, sound level doesn't matter when recording 32-bit float (until things are so loud your microphone was probably going to explode anyway). Fix it in post.
When it comes to audio playback, integers are probably fine: levels have to be in a kind-of-reasonable range so different audio has similar levels. (i.e. you expect volume at 70% should be similarly-loud for all songs, so wildly-different levels aren't worth worrying about.)
One thing that should be clear: IEEE floating point numbers are a decent approximation of real numbers (i.e. \(\mathbb{R}\) from the math class), but they are definitely different. For real numbers, we know the basic operations are associative:
\[ (a+b)+c = a+(b+c)\,. \]
Addition and multiplication on floating point numbers is not associative:
>>> (0.7 + 0.2) + 0.1 0.9999999999999999 >>> 0.7 + (0.2 + 0.1) 1.0 >>> (0.12345678 * 0.12345678) * 0.56789 0.008655538894467976 >>> 0.12345678 * (0.12345678 * 0.56789) 0.008655538894467974
In other words: algebra (as you understand it) doesn't work on floats.
\[ (a+b)+c \approx a+(b+c)\,,\\ (a\times b)\times c \approx a\times (b\times c)\,. \]
It turns out we have seen this already. You just didn't notice.
When adding arrays in the pipeline demo, I used float
(single precision) as the array elements.
#define DATA_T float
Then did a different calculation that had different performance…
DATA_T array_sum_1(DATA_T* arr, uint64_t n) { DATA_T total = 0; for (uint64_t i = 0; i < n; i++) { total += arr[i]; } return total; }
DATA_T array_sum_3(DATA_T* arr, uint64_t n) { DATA_T total1 = 0, total2 = 0; for (uint64_t i = 0; i < n - 1; i += 2) { total1 += arr[i]; total2 += arr[i + 1]; } return total1 + total2; }
The output from summing the same array of small numbers with those functions:
array_sum_1 ... Result: 173.1003113 array_sum_2 ... Result: 173.1003113 array_sum_3 ... Result: 173.1308441
The additions were done in a different order, so the rounding errors accumulated differently.
Summing many small values (exactly like this problem) is one of the places rounding errors can get really bad.
I also said at the time that the speed difference in that demo disappeared when changing from float
to an integer type.
At least part of the reason: integer addition is associative*. The compiler is free to change the order of operations all it wants, because the result of the calculation will be the same. It might have done that optimization for us.
The compiler won't rearrange floating point operations because of the implicit contract you have with it: the compiled code is going to produce exactly the result you asked for.
The compiler can rearrange/change/skip operations as much as it likes, as long as the result is exactly the same.
The result: we'll generally see much more modest compiler optimizations around FP code. The compiler is almost always bound to do exactly the calculations you requested, in exactly the order you wrote them.
We'll talk more later about how to ask the compiler nicely to be less precise with floating point operations.