# Understanding Floating-Point Numbers

Misunderstandings regarding floating-point numbers and their strengths and weaknesses are common. This entry was a resource I threw together for some peers that I thought might be generally useful for others. I apologize for the suboptimal usability on mobile devices as some of the content is necessarily wide.

## CS101

Let's step back to CS101 for a moment and discuss basic integers, and then we'll go through how floating-point numbers are stored, which will help us understand their weaknesses and strengths. Just as importantly we will dispel some beliefs about weaknesses that just aren't true.

Binary integer representations are a sequence of bits where the least significant bit is worth, in a decimal sense, either 0 or 1. The next is worth either 0 or 2. The next 0 or 4. And so on.

Each bit is either off and worth 0, or on and worth 2position — two to the power of position, or 2**position — where the position count starts at 0.

A 16-bit unsigned integer representing the value NaN, for instance, might look like-

(Click on the bits to change values! This is an interactive tutorial)

We typically display such values with the least significant bit on the right, just as the least significant decimal value is on the right in base10.

If the integer were signed — if it needs the ability to hold negative values — the top bit would be reserved for the negative flag and would necessitate a separate discussion about two’s-complement, though that isn't used for floating-point types so that's a discussion for another day.

Why stop at 20 (1) for the least significant bit? What if we used some bits for negative exponents? For those who remember basic math, a negative exponent 𝑛-x is equal to 1/𝑛x, e.g. 2-3 = 1/(23) = 1/8 (e.g. 0.125).

Imagine a 16-bit number where we dedicate one half of the bits for negative exponents, in this case representing the number NaN.

Behold, a fixed-point binary number! Click on some of the negative exponents to yield a decimal fraction. The triangle denotes the separation between positive and negative exponents — which is another way to say between the whole number components and fractional/decimal components — and will be used in all following examples. In this case the number has a precision of 1/256, and a max magnitude of 1/256th under 256.

Given that the negative exponents represent a finite set of fractions, however, means that a given number is stored as its closest representation. In this case we can store the integers 0 to 255 precisely, but can't store larger numbers. We can store the fractional portion of decimal fractions to the closest 1/256th.

Fixed-point numbers are uncommon in contemporary use, usually requiring expansion libraries in most languages, and are hindered by the lack of hardware support.

Most hardware and software is optimized and designed for floating point numbers, which allows the exponent window to shift on a need basis: Either larger in magnitude but coarser and less precise, or smaller in magnitude and more precise. The compromise of floating-point numbers is that they need to dedicate some of the bits to store that shift. A single-precision floating point number uses 8 of its 32 bits to store the shift of the exponents, for instance.

## Let Your Float Bit Slide 🎶

A floating-point number has, as the name states, a floating point. This entails storing a separate value — reserving some of the storage bits — that denotes the shift of the exponents, allowing the core value to vary in magnitude from larger numbers than the value space could traditionally hold, to much smaller fractional numbers. It can adapt to the needs of what it is storing rather than being fixed.

The larger in magnitude the number stored, the coarser and less precise it will be as the window shifts to the left. Similarly, the smaller the number stored, the more precise fractions can be captured as the window slides to the right.

### The Common Floating Point Variants

There are a variety of floating-point number types in common use. These feature a sign (1 bit) which defines whether it's a positive or negative number (if the sign bit is 1 the number is negative, and if 0 it is positive), an exponent offset which dictates the amount that the decimal point "floats" (e.g. it defines where the exponents start to be negative), and fraction/mantissa bits which define the actual value itself. Note that the colors used in this visualization mirror those seen on Wikipedia for consistency purposes.

In effect the exponent offset dictates the *quantity* (range) of a number, while the fraction bits dictate the quality of a number. There are cases where more magnitude via exponent bits is more important than the precise quality of the fractional bits, and vice versa.

• Double precision (binary64, FP64) holding the sign (1 bit), exponent offset (11 bits) and the fraction (52 bits) in 64 bits. Every number in standard JavaScript is a double-precision floating-point number, so when you're in a for loop counting from 1 to 100, that value is actually an FP64. Which is okay because an FP64 can perfectly store every whole numbers between -9,007,199,254,740,991 and 9,007,199,254,740,991 (2^53 - 1).
• Single precision (binary32, FP32) holding the sign (1 bit), exponent offset (8 bits) and the fraction (23 bits) in 32 bits. When performance is critical, single-precision floating-point numbers come to the fore, where something like AVX2 can multiply and add eight separate single precision numbers with eight other single precision numbers in a single operation per AVX unit, with some CPUs having two AVX units per core. Many GPUs can calculate trillions of single-precision calculations per second. An FP32 can perfectly store every whole numbers between -16,777,215 and 16,777,215. Exceed these bounds and you'll start counting by 2s, then 4s, then 8s, and so on, courtesy of the exponent window shift.

### The Less Common, Generally Only ML / AI / NN Variants

Machine learning has unique needs. Many models are comprised of billions of parameters. Value density is valuable not only during the training process where the smallest possible value that can capture the nuance and survive gradient descent is chosen, but after training a quantization might be performed to sub in small types for larger types if model accuracy is only moderately compromised.

• Half precision (binary16, FP16) holding the sign (1 bit), exponent offset (5 bits) and the fraction (10 bits) in 16 bits. This is most commonly seen in machine learning where the density of values allows for a doubling of parameters in the same memory footprint, while the precision is often satisfactory for gradient descent.
• BFloat16 (BF16) holding the sign (1 bit), exponent offset (8 bits) and the fraction (7 bits) in 16 bits. While FP16 uses 5 bits to store the exponent offset, BF16 uses 8 bits to do the same. This gives it a larger range at the cost of precision. It is seen in some machine learning scenarios, such as with the nvidia H100 where you can yield almost 2,000 TFLOPs using BF16 on tensor cores.
• TensorFloat-32 (TF32) holding the sign (1 bit), exponent offset (8 bits) and the fraction (10 bits) in 19 bits. TF32 is a misnomer as it isn't 32-bits, but it does have the range of an FP32 while having the precision of a FP16. TF32s exist on nvidia hardware in their Tensor cores, and when leveraged can yield high levels of performance.
• FP8 comes in two variants, each with a sign bit, but with two variants of the exponent:fraction/mantissa ratio. E4M3 and E5M2. More exponent bits means more range but less precision, and vice versa.
• FP4 also comes in two variants each with a sign bit, but with two variants of the exponent:fraction/mantissa ratio. E2M1 and E3M0. This is really pushing the limits, but this can be seen in some scenarios where layers of models are quantized to such tiny types. And while 0 mantissa bits seems impossible, as described further in this piece there is an implicit 1 outside of special situations, so the exponent bits are effectively standing in for a single mantissa bit, courtesy of the mechanisms describe later in this piece.

Floating-point numbers have an implicit leading binary 1 for the fractional component, outside of a couple of special cases which will be detailed later.

In the following representation of a half-precision value, where the 16 bits are distributed as 1 sign bit, 5 exponent bits, and 10 fraction bits, the implicit leading 1 is shown in the red-bordered box.

Play with the exponent offset, using the arrows to increase or decrease the magnitude of the fractional bits. The exponent offset is shown as the offset from the bias. Click on the sign and the fractional bits themselves. Watch how all of it impacts the resulting bit storage for this half-precision floating point number. Note that if you set all the controllable fractional values to 0, the number likely still has a value courtesy of the implicit bit unless you set the exponent offset to -15.

So how would you store zero?

Set all explicit fractional bits to zero -- you can't set the implicit bit which is there purely for visualization purposes, so disregard it. Now set the exponent offset to the lowest possible value, which for a half-precision FP is -15. This is a special magic condition for the value 0. Either positive zero if the sign bit is not set, or -0 if it is. And yes, you can click on the sign to the left of the number.

There is one other magic condition if all exponent bits are set to 1 where the value is either Inf or NaN. This example case always displays Inf for simplicity in that case.

### The Single Precision (FP32) Example

A binary32 — aka single-precision or FP32 — floating-point number has 23 mantissa bits (the actual value, sometimes called the fraction) and an implicit additional top bit of 1 as mentioned, ergo 24 bits defining the value. These are the bottom 23 bits in the value: bits 0-22.

The exponent shift of a single-precision value occupies 8-bits and while the standard allows for that to be a signed 8-bit integer, most implementations use a biased encoding where 127 (e.g. 01111111) = 0 such that the exponent shift = value – 127 (so below 127 is incrementally negative, above is incrementally positive). A value of 127 indicates that the [binary decimal point/separation between 0 and negative exponents] lies directly after the implicit leading 1, while <127 move it successively to the right, and >127 numbers move it to the left. The exponent offset bits sit above the mantissa, occupying bits 23-30.

At the very top — the most significant bit — lies a flag indicating whether the value is negative or not. Unlike with two’s-complement values seen in pure integers, with floating point numbers a single bit swaps the value to its inverse value. This is bit 31. The flag is set when the value is negative, clear if positive.

Let’s look at a floating-point number, starting with one holding the integer value 65535, with no fractional.

With this sample you have the ability to change the exponent shift — the 8-bit shift integer of the single-precision floating point — to see the impact. Note that if you were going to use this shift in an actual single-precision value, you would need to add 127 to the value (e.g. 10 would become 137, and -10 would be 117).

The red bordered box indicates the implicit bit that isn’t actually stored in the value. In the default state again it’s notable that with a magnitude of 65535 — the integer portion occupying 15 real bits and the 1 implicit bit — the max precision is 1/256th.

If instead we stored 255, the precision jumps to 1/65536. The precision is dictated by the magnitude of the value.

To present an extreme example, what if we represented the population of the Earth-

Precision has dropped to 29 — 512. Only increments of 512 can be stored. The larger the magnitude, the coarser the precision.

#### Impossibly Small Types

A few esoteric types have appeared in the machine learning space, particularly for memory density of massive models when a network can be quantized to smaller types. If a network is highly discerning and the neurons have been trained towards extremes, the wide degree of variability of larger types may not be necessary.

#### E3M0 (4 bit type)

Hopefully this has helped.

## Basic Things You Should Know About Floating-Point Numbers

• Single-precision floating point numbers can precisely hold whole numbers from -16,777,215 to 16,777,215, with zero ambiguity. Many users derive the wrong assumption from the fractional variances that all numbers are approximations, but many (including fractional representations that fit within the magnitude — e.g. 0.25, 0.75, 0.0042572021484375, etc — can be precisely stored. The key is that the number is the decimal representation of a fraction where the denominator is a power of 2 and lies within the precision band of a given magnitude)
• Double-precision floating point numbers can precisely hold whole numbers from -9,007,199,254,740,991 to 9,007,199,254,740,991. You can very easily calculate the precision allowed for a given magnitude (e.g. if the magnitude is between 1 and 2, the precision is within 1/4503599627370496, which for the vast majority of uses is well within any reasonable bounds.
• Every number in JavaScript is a double-precision floating point. Your counter, “const”, and other seeming integers are DPs. If you use bitwise operations it will temporarily present it as an unsigned 32-bit int as a hack. Some decimal libraries layer atop this to present very inefficient, but sometimes necessary, decimal representations.
• Decimal types can be mandated in some domains, but represent a dramatic speed compromise (courtesy of the reality that our hardware is extremely optimized for floating-point math). With some analysis of the precision for a given task, and intelligent rounding rules, double-precision is more than adequate for most purposes. There are scenarios where you can pursue a hybrid approach: In an extremely high Internal Rate of Return calculation I use SP to get to an approximate solution, and then decimal math to get an absolutely precise solution (the final, smallest leg). Okay I'm overstating a bit — even an infinitely large decimal type can't store 1/3 precisely — however some decimal libraries let you calculate and store as many digits as you need, albeit very inefficiently.
-->