Floating-Point Numbers / An Infinite Number of Mathematicians Enter A Bar

tldr; The bits of a floating-point number have shifting exponents, with negative exponents holding the fractional portion of real numbers.

The Basics

Floating-point numbers aren’t always a precise representation of a given real or integer value. This is CS101 material, repeated on programming boards regularly: 0.1 + 0.2 != 0.3, etc (less known, for a single-precision value 16,777,220 = 16,777,219 & 4,000,000,000 = 4,000,000,100).

Most of us are aware of this limitation, yet many don’t understand why, whether they missed that class or forgot it since. Throughout my career I’ve been surprised at the number of very skilled, intelligent peers – brilliant devs – who have treated this as a black box, or had an incorrect knowledge of their function. More than one has believed such numbers stored two integers of a fraction, for instance.

So here’s a quick visual and interactive explanation of floating-point numbers, in this case the dominant IEEE 754 variants such as binary32, binary64, and a quick mention of binary16. I decided to author this as a lead-in to an upcoming entry about financial calculations, where these concerns become critical and I need to call out to it.

I’m going to start with a relevant joke (I’d give credit if I knew the origin)-

An infinite number of mathematicians walk into a bar. The first orders a beer, the second orders half a beer, the third orders a quarter of a beer, and so on. The bartender pours two beers and says “know your limits.”

In the traditional unsigned integer that we all know and love, each successively more significant bit is worth double the amount of the one before (powers of 2) when set, going from 20 (1) for the least significant bit, to 2bitsize-1 for the most significant bit (e.g. 231 for a 32-bit unsigned integer). If the integer were signed the top bit would be reserved for the negative flag (which entails a discussion about two’s-complement representations that I’m not going to enter into, so I’ll veer clear of negative integers).

An 8-bit integer might look like (the bits can be toggled for no particular reason beyond keeping the attention of readers)-

Why stop at 20 (1) for the LSB? What if we used 16 bits and reserved the last 8 bits for negative exponents? For those who remember basic math, a negative exponent -x is equal to 1/nx, e.g. 2-3 = 1/23 = 1/8.

Behold, a binary fixed-point number: Click on some of the negative exponents to yield a real number. The triangle demarcates between whole and fractional exponents. In this case the number has a precision of 1/256, and a max magnitude of 1/256th under 256.

We’re halfway there.

A floating-point number has, as the name states, a floating point. This entails storing a separate value detailing the shift of the exponents (thus defining where the point lies — the separation between the exponent 0 and negative exponents, if any).

Before we get into that, one basic about floating-point numbers: They have an implicit leading binary 1. If a floating-point value had only 3 value/fraction bits and they were set to 000, the actual value of the floating-point is 1000 courtesy of this leading implicit bit.

To explain the structure of a floating-point number, a decimal32 — aka single-precision — floating-point number has 23 mantissa bits (the actual value, sometimes called the fraction) and an implicit additional top bit of 1 as mentioned, ergo 24 bits defining the value. These are the bottom 23 bits in the value: bits 0-22.

The exponent shift of a single-precision value occupies 8-bits and while the standard allows for that to be a signed 8-bit integer, most implementations use a biased encoding where 127 (e.g. 01111111) = 0 such that the exponent shift = value – 127 (so below is incrementally negative, above is incrementally positive). A value of 127 indicates that the [binary decimal point/separation between 0 and negative exponents] lies directly after the implicit leading 1, while <127 move it successively to the right, and >127 numbers move it to the left. The exponent bits sit above the mantissa, occupying bits 23-30.

At the very top — the most significant bit — lies a flag indicating whether the value is negative or not. Unlike with two’s-complement values seen in pure integers, with floating point numbers a single bit swaps the value to its inverse value. This is bit 31.

“But how can a floating-point value hold 0 if the high bit of the value/mantissa/fraction is always 1?”

If all bits are set to 0 — the flag, exponent shift and the value — it represents a value 0, and if just the flag is 1 it represents -0. If the exponent shift is all 1s, this can indicate either NaN or Inf depending upon whether the fractional portion has values set. Those are the magic numbers of floating points.

Let’s look at a floating-point number, starting with one holding the integer value 65535, with no fractional.

With this sample you have the ability to change the exponent shift — the 8-bit shift integer of the single-precision floating point — to see the impact. Note that if you were going to use this shift in an actual single-precision value, you would need to add 127 to the value (e.g. 10 would become 137, and -10 would be 117).

The red bordered box indicates the implicit bit that isn’t actually stored in the value. In the default state again it’s notable that with a magnitude of 65535 — the integer portion occupying 15 real bits and the 1 implicit bit — the max precision is 1/256th.

If instead we stored 255, the precision jumps to 1/65536. The precision is dictated by the magnitude of the value.

To present an extreme example, what if we represented the population of the Earth-

Precision has dropped to 29 — 512. Only increments of 512 can be stored.

More recently the industry has seen an interest in something called half-precision floating point values, particularly in compute and neural net calculations.

Half-precision floating point values offer a very limited range, but fit in just 16-bits.

That’s the basic operation of floating point numbers. It’s a set of value bits where the exponent range can be shifted. Double-precision (binary64) floating points up the value/fraction storage to 53-bits (52-bits stored, plus the 1 intrinsic bit), and the exponent shift to 11 bits offering a far greater precision and/or magnitude, coming closer to the infinite number of mathematicians at representing small scale numbers. I am not going to simulate such a number on here as it would exceed the bounds of reader screens.

Hopefully this has helped.

Basic Things You Should Know About Floating-Point Numbers

  • Single-precision floating point numbers can precisely hold whole numbers from -16,777,215 to 16,777,215, with zero ambiguity. Many users derive the wrong assumption from the fractional variances that all numbers are approximations, but many (including fractional representations that fit within the magnitude — e.g. 0.25, 0.75, 0.0042572021484375, etc — can be precisely stored. The key is that the number is the decimal representation of a fraction where the denominator is a power of 2 and lies within the precision band of a given magnitude)
  • Double-precision floating point numbers can precisely hold whole numbers from -9,007,199,254,740,991 to 9,007,199,254,740,991. You can very easily calculate the precision allowed for a given magnitude (e.g. if the magnitude is between 1 and 2, the precision is within 1/4503599627370496, which for the vast majority of uses is well within any reasonable bounds.
  • Every number in JavaScript is a double-precision floating point. Your counter, “const”, and other seeming integers are DPs. If you use bitwise operations it will temporarily present it as an unsigned 32-bit int as a hack. Some decimal libraries layer atop this to present very inefficient, but sometimes necessary, decimal representations.
  • Decimal types can be mandated in some domains, but represent a dramatic speed compromise (courtesy of the reality that our hardware is extremely optimized for floating-point math). With some analysis of the precision for a given task, and intelligent rounding rules, double-precision is more than adequate for most purposes. There are scenarios where you can pursue a hybrid approach: In an extremely high Internal Rate of Return calculation I use SP to get to an approximate solution,
    and then decimal math to get an absolutely precise solution (the final, smallest leg).
  • On most modern processors double-precision calculations run at approximately half the speed of single-precision calculations (presuming that you’re using SIMD, where an AVX unit may do 8 DP calculations per cycle, or 16 SP calculations per cycle). Half-precision calculations, however, do not offer any speed advantage beyond reducing memory throughput and scale necessary. The instructions to unpack and pack binary16 is a relatively new addition.
  • On most GPUs, double-precision calculations are dramatically slower than single-precision calculations. While most processors have floating point units that perform single-precision calculations on double-precision hardware or greater, most offering SIMD to do many calculations at once, GPUs were built for single-precision calculations, and use entirely different hardware for double precision calculations. Hardware that is often in short supply (most GPUs offer 1/24 to 1/32 the number of DP units). On the flip side, most GPUs use SIMD on single-precision hardware to do multiple half-precision calculations, offering the best performance of all.
  • Some very new compute-focused devices offer spectacular DP performance. The GP100 from nvidia offers 5 TFLOPS of DP calculations, about 10 TFLOPS of SP calculations, and 20 TFLOPS of half-precision calculations. These are incredible new heights.