Understanding Floating-Point Numbers

Misunderstandings regarding floating-point numbers and their strengths and weaknesses are common, even among seasoned vets in the industry. A common misunderstanding is regarding the accuracy of whole numbers when stored in floating-point formats.

Let's step back to CS101 for a moment, and then we'll go through how floating-point numbers are stored, an understanding of which will help us understand their weaknesses and strengths, and just as importantly dispel some beliefs about weaknesses that just aren't true.


Binary integer representations, as we all know, are a sequence of bits where the least significant bit is worth, in a decimal sense, either 0 or 1. The next more significant bit is worth either 0 or 2. The next 0 or 4. Etc.

Each bit is either off and worth 0, or on and worth 2position where the position count starts at 0.

An 8-bit integer, for instance, might look like-

(Note that the bits can be toggled on and off)

We typically display such values with the least significant bit on the right, just as the least significant decimal value is on the right in base10.

If the integer were signed — if it needs to have the ability to hold negative values — the top bit would be reserved for the negative flag, and would necessitate a separate discussion about two’s-complement, though that isn't used or necessary for floating-point types so we'll leave it at that.

Why stop at 20 (1) for the least significant bit? What if we used some bits for negative exponents? For those who remember basic math, a negative exponent -x is equal to 1/2x, e.g. 2-3 = 1/(23) = 1/8 (e.g. 0.125).

Behold, a fixed-point binary number! Click on some of the negative exponents to yield a decimal fraction. The triangle denotes the separation between whole and fractional components. In this case the number has a precision of 1/16, and a max magnitude of 1/16th under 16.

Given that the negative exponents represent a finite set of fractions, however, means that a given number is stored as its closest representation. In this case, with 4 whole bits and 4 fractional bits, we can store the integers 0 to 15 precisely, but can't store larger numbers. We can store the fractional portion of decimal fractions to the closest 1/16th (ergo, if it happens to be precisely equivalent to a 1/16th increment it is stored precisely — e.g. 1.3125 is perfectly stored in this scenario as 0001 0101 — but we wouldn't know whether the original number was 1.3125 or 1.3128 or 1.3122).

Fixed-point numbers are uncommon in contemporary use (usually requiring expansion libraries in most languages, and often offering significantly lower performance given the lack of hardware support). Instead we use floating point numbers, which as the name declares have a point move around depending upon the needs of the number stored.

Let Your Float Bit Slide 🎶

A floating-point number has, as the name states, a floating point. This entails storing a separate value — reserving some of the storage bits — that denotes the shift of the exponents, allowing the core value to vary in magnitude from larger numbers than the value space could traditionally hold, to much smaller fractional numbers.

The larger in magnitude the number stored gets, the coarser and less precise it will be.

There are two types of floating-point numbers in common use-

And more commonly there has been a rising use of half precision (binary16) holding the exponent offset, fraction and sign in 16 bits

Half-Precision Floating-Point Numbers

We'll start with the half-precision floating-point number. This type is of increasing usage given that it is of satisfactory precision for some deep learning applications, and on appropriate hardware can significantly improve performance and reduce memory usage.

Before we get into that, one basic about floating-point numbers: They have an implicit leading binary 1. If a floating-point value had only 3 value/fraction bits and they were set to 000, the actual value of the floating-point is 1000 courtesy of this leading implicit bit.

To explain the structure of a floating-point number, a binary32 — aka single-precision — floating-point number has 23 mantissa bits (the actual value, sometimes called the fraction) and an implicit additional top bit of 1 as mentioned, ergo 24 bits defining the value. These are the bottom 23 bits in the value: bits 0-22.

The exponent shift of a single-precision value occupies 8-bits and while the standard allows for that to be a signed 8-bit integer, most implementations use a biased encoding where 127 (e.g. 01111111) = 0 such that the exponent shift = value – 127 (so below is incrementally negative, above is incrementally positive). A value of 127 indicates that the [binary decimal point/separation between 0 and negative exponents] lies directly after the implicit leading 1, while <127 move it successively to the right, and >127 numbers move it to the left. The exponent bits sit above the mantissa, occupying bits 23-30.

At the very top — the most significant bit — lies a flag indicating whether the value is negative or not. Unlike with two’s-complement values seen in pure integers, with floating point numbers a single bit swaps the value to its inverse value. This is bit 31.

“But how can a floating-point value hold 0 if the high bit of the value/mantissa/fraction is always 1?”

If all bits are set to 0 — the flag, exponent shift and the value — it represents a value 0, and if just the flag is 1 it represents -0. If the exponent shift is all 1s, this can indicate either NaN or Inf depending upon whether the fractional portion has values set. Those are the magic numbers of floating points.

Let’s look at a floating-point number, starting with one holding the integer value 65535, with no fractional.

With this sample you have the ability to change the exponent shift — the 8-bit shift integer of the single-precision floating point — to see the impact. Note that if you were going to use this shift in an actual single-precision value, you would need to add 127 to the value (e.g. 10 would become 137, and -10 would be 117).

The red bordered box indicates the implicit bit that isn’t actually stored in the value. In the default state again it’s notable that with a magnitude of 65535 — the integer portion occupying 15 real bits and the 1 implicit bit — the max precision is 1/256th.

If instead we stored 255, the precision jumps to 1/65536. The precision is dictated by the magnitude of the value.

To present an extreme example, what if we represented the population of the Earth-

Precision has dropped to 29 — 512. Only increments of 512 can be stored.

More recently the industry has seen an interest in something called half-precision floating point values, particularly in compute and neural net calculations.

Half-precision floating point values offer a very limited range, but fit in just 16-bits.

That’s the basic operation of floating point numbers. It’s a set of value bits where the exponent range can be shifted. Double-precision (binary64) floating points up the value/fraction storage to 53-bits (52-bits stored, plus the 1 intrinsic bit), and the exponent shift to 11 bits offering a far greater precision and/or magnitude, coming closer to the infinite number of mathematicians at representing small scale numbers. I am not going to simulate such a number on here as it would exceed the bounds of reader screens.

Hopefully this has helped.

Basic Things You Should Know About Floating-Point Numbers