# Understanding Floating-Point Numbers

Misunderstandings regarding floating-point numbers and their strengths and weaknesses are common, even among seasoned vets in the industry. A common misunderstanding is regarding the accuracy of whole numbers when stored in floating-point formats.

Let's step back to CS101 for a moment, and then we'll go through how floating-point numbers are stored, an understanding of which will help us understand their weaknesses and strengths, and just as importantly dispel some beliefs about weaknesses that just aren't true.

## CS101

Binary integer representations, as we all know, are a sequence of bits where the least significant bit is worth, in a decimal sense, either 0 or 1. The next more significant bit is worth either 0 or 2. The next 0 or 4. Etc.

Each bit is either off and worth 0, or on and worth 2^{position} where the position count starts at
0.

An 8-bit integer, for instance, might look like-

**(Note that the bits can be toggled on and off)**

We typically display such values with the least significant bit on the right, just as the least significant
decimal value is on the right in base_{10}.

If the integer were signed — if it needs to have the ability to hold negative values — the top bit would be reserved for the negative flag, and would necessitate a separate discussion about two’s-complement, though that isn't used or necessary for floating-point types so we'll leave it at that.

Why stop at 2^{0} (1) for the least significant bit? What if we used some bits for *negative *exponents?
For those who remember basic math, a negative exponent ^{-x} is equal to 1/2^{x}, e.g.
2^{-3} = 1/(2^{3}) = 1/8 (e.g. 0.125).

Behold, a fixed-point binary number! Click on some of the negative exponents to yield a decimal fraction. The triangle denotes the separation between whole and fractional components. In this case the number has a precision of 1/16, and a max magnitude of 1/16th under 16.

Given that the negative exponents represent a finite set of fractions, however, means that a given number is stored as its closest representation. In this case, with 4 whole bits and 4 fractional bits, we can store the integers 0 to 15 precisely, but can't store larger numbers. We can store the fractional portion of decimal fractions to the closest 1/16th (ergo, if it happens to be precisely equivalent to a 1/16th increment it is stored precisely — e.g. 1.3125 is perfectly stored in this scenario as 0001 0101 — but we wouldn't know whether the original number was 1.3125 or 1.3128 or 1.3122).

Fixed-point numbers are uncommon in contemporary use (usually requiring expansion libraries in most languages, and often offering significantly lower performance given the lack of hardware support). Instead we use floating point numbers, which as the name declares have a point move around depending upon the needs of the number stored.

## Let Your Float Bit Slide 🎶

A floating-point number has, as the name states, a *floating* point. This entails storing a separate
value — reserving some of the storage bits — that denotes the shift of the exponents, allowing the core
value to vary in magnitude from larger numbers than the value space could traditionally hold, to much
smaller fractional numbers.

The larger in magnitude the number stored gets, the coarser and less precise it will be.

There are two types of floating-point numbers in common use-

**Double precision**(binary64) holding the exponent offset, the fraction, and the negative bit in 64 bits. Every number in standard JavaScript is a double-precision floating-point number.**Single precision**(binary32) holding the exponent offset, the fraction, and the negative bit in 32 bits. When performance is critical, single-precision floating-point numbers come to the fore, where something like AVX2 can multiply and add eight separate single precision numbers with eight other single precision numbers in a single operation per AVX unit, with some CPUs have two AVX units per core

And more commonly there has been a rising use of half precision (binary16) holding the exponent offset, fraction and sign in 16 bits

### Half-Precision Floating-Point Numbers

We'll start with the half-precision floating-point number. This type is of increasing usage given that it is of satisfactory precision for some deep learning applications, and on appropriate hardware can significantly improve performance and reduce memory usage.

Before we get into that, one basic about floating-point numbers: They have an implicit leading binary 1. If a floating-point value had only 3 value/fraction bits and they were set to 000, the actual value of the floating-point is 1000 courtesy of this leading implicit bit.

To explain the structure of a floating-point number, a binary32 — aka single-precision —
floating-point number has 23 mantissa bits (the actual value, sometimes called the fraction) and an **implicit
additional top bit of 1** as mentioned, ergo 24 bits defining the value. These are the bottom 23
bits in the value: bits 0-22.

The exponent shift of a single-precision value occupies 8-bits and while the standard allows for that to be a signed 8-bit integer, most implementations use a biased encoding where 127 (e.g. 01111111) = 0 such that the exponent shift = value – 127 (so below is incrementally negative, above is incrementally positive). A value of 127 indicates that the [binary decimal point/separation between 0 and negative exponents] lies directly after the implicit leading 1, while <127 move it successively to the right, and >127 numbers move it to the left. The exponent bits sit above the mantissa, occupying bits 23-30.

At the very top — the most significant bit — lies a flag indicating whether the value is negative or not. Unlike with two’s-complement values seen in pure integers, with floating point numbers a single bit swaps the value to its inverse value. This is bit 31.

*“But how can a floating-point value hold 0 if the high bit of the value/mantissa/fraction is always
1?”*

If *all* bits are set to 0 — the flag, exponent shift and the value — it represents a
value 0, and if just the flag is 1 it represents -0. If the exponent shift is all 1s, this can indicate
either NaN or Inf depending upon whether the fractional portion has values set. Those are the magic numbers
of floating points.

Let’s look at a floating-point number, starting with one holding the integer value 65535, with no fractional.

With this sample you have the ability to change the exponent shift — the 8-bit shift integer of the single-precision floating point — to see the impact. Note that if you were going to use this shift in an actual single-precision value, you would need to add 127 to the value (e.g. 10 would become 137, and -10 would be 117).

The red bordered box indicates the implicit bit that isn’t actually stored in the value. In the default state again it’s notable that with a magnitude of 65535 — the integer portion occupying 15 real bits and the 1 implicit bit — the max precision is 1/256th.

If instead we stored 255, the precision jumps to 1/65536. The precision is dictated by the magnitude of the value.

To present an extreme example, what if we represented the population of the Earth-

Precision has dropped to 2^{9} — 512. Only increments of 512 can be stored.

More recently the industry has seen an interest in something called half-precision floating point values, particularly in compute and neural net calculations.

Half-precision floating point values offer a very limited range, but fit in just 16-bits.

That’s the basic operation of floating point numbers. It’s a set of value bits where the exponent range can be shifted. Double-precision (binary64) floating points up the value/fraction storage to 53-bits (52-bits stored, plus the 1 intrinsic bit), and the exponent shift to 11 bits offering a far greater precision and/or magnitude, coming closer to the infinite number of mathematicians at representing small scale numbers. I am not going to simulate such a number on here as it would exceed the bounds of reader screens.

Hopefully this has helped.

## Basic Things You Should Know About Floating-Point Numbers

- Single-precision floating point numbers can
*precisely*hold whole numbers from -16,777,215 to 16,777,215, with zero ambiguity. Many users derive the wrong assumption from the fractional variances that all numbers are approximations, but many (including fractional representations that fit within the magnitude — e.g. 0.25, 0.75, 0.0042572021484375, etc — can be precisely stored. The key is that the number is the decimal representation of a fraction where the denominator is a power of 2 and lies within the precision band of a given magnitude) - Double-precision floating point numbers can precisely hold whole numbers from -9,007,199,254,740,991 to 9,007,199,254,740,991. You can very easily calculate the precision allowed for a given magnitude (e.g. if the magnitude is between 1 and 2, the precision is within 1/4503599627370496, which for the vast majority of uses is well within any reasonable bounds.
- Every number in JavaScript is a double-precision floating point. Your counter, “const”, and other seeming integers are DPs. If you use bitwise operations it will temporarily present it as an unsigned 32-bit int as a hack. Some decimal libraries layer atop this to present very inefficient, but sometimes necessary, decimal representations.
- Decimal types can be mandated in some
domains, but represent a
*dramatic*speed compromise (courtesy of the reality that our hardware is extremely optimized for floating-point math). With some analysis of the precision for a given task, and intelligent rounding rules, double-precision is more than adequate for most purposes. There are scenarios where you can pursue a hybrid approach: In an extremely high Internal Rate of Return calculation I use SP to get to an approximate solution,

and then decimal math to get an absolutely precise solution (the final, smallest leg). - On most modern processors double-precision calculations run at approximately half the speed of single-precision calculations (presuming that you’re using SIMD, where an AVX unit may do 8 DP calculations per cycle, or 16 SP calculations per cycle). Half-precision calculations, however, do not offer any speed advantage beyond reducing memory throughput and scale necessary. The instructions to unpack and pack binary16 is a relatively new addition.
- On most GPUs, double-precision calculations are dramatically slower than single-precision calculations. While most processors have floating point units that perform single-precision calculations on double-precision hardware or greater, most offering SIMD to do many calculations at once, GPUs were built for single-precision calculations, and use entirely different hardware for double precision calculations. That DP hardware is often in short supply (most GPUs offer 1/24 to 1/32 the number of DP units, though a very small number of expensive units offer good DP performance). On the flip side, most GPUs use SIMD on single-precision hardware to do multiple half-precision calculations, offering the best performance of all.
- Some very new compute-focused devices offer spectacular DP performance. The GP100 from nvidia offers 5 TFLOPS of DP calculations, about 10 TFLOPS of SP calculations, and 20 TFLOPS of half-precision calculations. These are incredible new heights.