Understanding Floating-Point Numbers

Misunderstandings regarding floating-point numbers and their strengths and weaknesses are common. This entry was a resource I threw together for some peers that I thought might be generally useful for others. I apologize for the suboptimal usability on mobile devices as some of the content is necessarily wide.


Let's step back to CS101 for a moment and discuss basic integers, and then we'll go through how floating-point numbers are stored, which will help us understand their weaknesses and strengths. Just as importantly we will dispel some beliefs about weaknesses that just aren't true.

Binary integer representations are a sequence of bits where the least significant bit is worth, in a decimal sense, either 0 or 1. The next is worth either 0 or 2. The next 0 or 4. And so on.

Each bit is either off and worth 0, or on and worth 2position — two to the power of position, or 2**position — where the position count starts at 0.

A 16-bit unsigned integer representing the value NaN, for instance, might look like-

(Click on the bits to change values! This is an interactive tutorial)

We typically display such values with the least significant bit on the right, just as the least significant decimal value is on the right in base10.

If the integer were signed — if it needs the ability to hold negative values — the top bit would be reserved for the negative flag and would necessitate a separate discussion about two’s-complement, though that isn't used for floating-point types so that's a discussion for another day.

Why stop at 20 (1) for the least significant bit? What if we used some bits for negative exponents? For those who remember basic math, a negative exponent 𝑛-x is equal to 1/𝑛x, e.g. 2-3 = 1/(23) = 1/8 (e.g. 0.125).

Imagine a 16-bit number where we dedicate one half of the bits for negative exponents, in this case representing the number NaN.

Behold, a fixed-point binary number! Click on some of the negative exponents to yield a decimal fraction. The triangle denotes the separation between positive and negative exponents — which is another way to say between the whole number components and fractional/decimal components — and will be used in all following examples. In this case the number has a precision of 1/256, and a max magnitude of 1/256th under 256.

Given that the negative exponents represent a finite set of fractions, however, means that a given number is stored as its closest representation. In this case we can store the integers 0 to 255 precisely, but can't store larger numbers. We can store the fractional portion of decimal fractions to the closest 1/256th.

Fixed-point numbers are uncommon in contemporary use, usually requiring expansion libraries in most languages, and are hindered by the lack of hardware support.

Most hardware and software is optimized and designed for floating point numbers, which allows the exponent window to shift on a need basis: Either larger in magnitude but coarser and less precise, or smaller in magnitude and more precise. The compromise of floating-point numbers is that they need to dedicate some of the bits to store that shift. A single-precision floating point number uses 8 of its 32 bits to store the shift of the exponents, for instance.

Let Your Float Bit Slide 🎶

A floating-point number has, as the name states, a floating point. This entails storing a separate value — reserving some of the storage bits — that denotes the shift of the exponents, allowing the core value to vary in magnitude from larger numbers than the value space could traditionally hold, to much smaller fractional numbers. It can adapt to the needs of what it is storing rather than being fixed.

The larger in magnitude the number stored, the coarser and less precise it will be as the window shifts to the left. Similarly, the smaller the number stored, the more precise fractions can be captured as the window slides to the right.

The Common Floating Point Variants

There are a variety of floating-point number types in common use. These feature a sign (1 bit) which defines whether it's a positive or negative number (if the sign bit is 1 the number is negative, and if 0 it is positive), an exponent offset which dictates the amount that the decimal point "floats" (e.g. it defines where the exponents start to be negative), and fraction/mantissa bits which define the actual value itself. Note that the colors used in this visualization mirror those seen on Wikipedia for consistency purposes.

In effect the exponent offset dictates the *quantity* (range) of a number, while the fraction bits dictate the quality of a number. There are cases where more magnitude via exponent bits is more important than the precise quality of the fractional bits, and vice versa.

The Less Common, Generally Only ML / AI / NN Variants

Machine learning has unique needs. Many models are comprised of billions of parameters. Value density is valuable not only during the training process where the smallest possible value that can capture the nuance and survive gradient descent is chosen, but after training a quantization might be performed to sub in small types for larger types if model accuracy is only moderately compromised.

The Implicit Leading One

Floating-point numbers have an implicit leading binary 1 for the fractional component, outside of a couple of special cases which will be detailed later.

In the following representation of a half-precision value, where the 16 bits are distributed as 1 sign bit, 5 exponent bits, and 10 fraction bits, the implicit leading 1 is shown in the red-bordered box.

Play with the exponent offset, using the arrows to increase or decrease the magnitude of the fractional bits. The exponent offset is shown as the offset from the bias. Click on the sign and the fractional bits themselves. Watch how all of it impacts the resulting bit storage for this half-precision floating point number. Note that if you set all the controllable fractional values to 0, the number likely still has a value courtesy of the implicit bit unless you set the exponent offset to -15.

So how would you store zero?

Set all explicit fractional bits to zero -- you can't set the implicit bit which is there purely for visualization purposes, so disregard it. Now set the exponent offset to the lowest possible value, which for a half-precision FP is -15. This is a special magic condition for the value 0. Either positive zero if the sign bit is not set, or -0 if it is. And yes, you can click on the sign to the left of the number.

There is one other magic condition if all exponent bits are set to 1 where the value is either Inf or NaN. This example case always displays Inf for simplicity in that case.

The Single Precision (FP32) Example

A binary32 — aka single-precision or FP32 — floating-point number has 23 mantissa bits (the actual value, sometimes called the fraction) and an implicit additional top bit of 1 as mentioned, ergo 24 bits defining the value. These are the bottom 23 bits in the value: bits 0-22.

The exponent shift of a single-precision value occupies 8-bits and while the standard allows for that to be a signed 8-bit integer, most implementations use a biased encoding where 127 (e.g. 01111111) = 0 such that the exponent shift = value – 127 (so below 127 is incrementally negative, above is incrementally positive). A value of 127 indicates that the [binary decimal point/separation between 0 and negative exponents] lies directly after the implicit leading 1, while <127 move it successively to the right, and >127 numbers move it to the left. The exponent offset bits sit above the mantissa, occupying bits 23-30.

At the very top — the most significant bit — lies a flag indicating whether the value is negative or not. Unlike with two’s-complement values seen in pure integers, with floating point numbers a single bit swaps the value to its inverse value. This is bit 31. The flag is set when the value is negative, clear if positive.

Let’s look at a floating-point number, starting with one holding the integer value 65535, with no fractional.

With this sample you have the ability to change the exponent shift — the 8-bit shift integer of the single-precision floating point — to see the impact. Note that if you were going to use this shift in an actual single-precision value, you would need to add 127 to the value (e.g. 10 would become 137, and -10 would be 117).

The red bordered box indicates the implicit bit that isn’t actually stored in the value. In the default state again it’s notable that with a magnitude of 65535 — the integer portion occupying 15 real bits and the 1 implicit bit — the max precision is 1/256th.

If instead we stored 255, the precision jumps to 1/65536. The precision is dictated by the magnitude of the value.

To present an extreme example, what if we represented the population of the Earth-

Precision has dropped to 29 — 512. Only increments of 512 can be stored. The larger the magnitude, the coarser the precision.

Impossibly Small Types

A few esoteric types have appeared in the machine learning space, particularly for memory density of massive models when a network can be quantized to smaller types. If a network is highly discerning and the neurons have been trained towards extremes, the wide degree of variability of larger types may not be necessary.

E5M2 (8 bit type)

E2M1 (4 bit type)

E3M0 (4 bit type)

Hopefully this has helped.

Basic Things You Should Know About Floating-Point Numbers