If it were to use base 10, then computers would have to work with ten different inputs, which are the numbers 0 through 9. A major problem with using base 10 in computers is that different values are represented by different voltage levels. In order, to increase voltage, energy must be used. Therefore, using only two voltage values to represent 1 and 0 greatly conserves on energy compared to using ten different voltages.

Obviously, computers have a way of representing binary numbers as decimal numbers, as not everyone understands how to read binary. Therefore, a computer must internally convert a binary number to decimal form. The way this is done for positive whole numbers is rather simple, as each significant index in binary is multiplied by itself and 2 raised to the power of the index. Then, these products are all added together to get the decimal equivalent. Say you have the binary number 101111101000101. The rightmost digit represents the “zeroth” index, whereas the leftmost digit represents the fourteenth index in this case. When converting to binary, we start at the lowest index or the rightmost index. So, the first calculation would be 1 * 20 as this index is the zeroth index and has a value of 1. Next, we move to the first index, whose value is 0, so the second calculation would be 0 * 21. This pattern would continue on and on for every binary digit.

After adding (1 * 214) + (0 * 213)...you end up getting 24,389. That means that 101111101000101 in base 2 is equivalent to 24,389 in base 10.

Why is this important to floating point numbers? Well, binary to decimal conversion is not as pretty when dealing with non-whole numbers. The concept is exactly the same. The number that comes after the decimal in binary are essentially held in negative indices. So, if you have the binary number 101.011, then the 1 right before the decimal is in the zeroth index, the 0 right after the decimal is in the -1 index, the 1 to the right of that is in the -2 index and so on. A problem occurs however, when a base 10 decimal is represented in binary through an infinite number of indices. For example, let’s say someone wants to do a calculation that involves the number 1.3. This number is very easy to represent in base 10, as it only requires two indices. However, to represent this number in binary, it requires an infinite number of indices. The reason is due to how non-whole numbers in base 10 are converted to base 2.

From the visual above, you can see that when converting a base 10 non-whole number to base 2, you take the decimal portion and multiply it by 2. If that product is above 1, then the index directly after the decimal in binary is 1. If it is below 1, then the binary value is 0. So for 1.3 in base 2, 1 is still in the zeroth index, then there is the decimal. Then for the -1 index, 0.3 is multiplied by 2 to get 0.6. Therefore, the -1 index is 0. Then 0.6 is multiplied by 2 to get 1.2, so the -2 index is 1, and this continues on and on. The only way this will stop is if the product ends up being 1. If it does, then that index is equal to 1 and the pattern can stop, as all indices that come after will be known to be 0. However, this never occurs for 1.3. The product never ends up as 1, causing an infinite loop. Since we don’t have an infinite number of bits to represent 1.3 in binary notation, it must be truncated to a certain index.

And that is where floating point numbers come into play. Obviously, computers have a set memory that cannot exceed a certain limit. Many modern computers either have 32-bit or 64-bit systems and I will primarily be focusing on 32-bit. In these systems, typically 24 bits are allocated for anything after the decimal point. If they allowed any number of bits to be used to represent non-whole numbers, we would run out of memory quite quickly because as shown by the example above, certain numbers must be represented with an infinite number of bits. So, what occurs with 1.3 within the computer is that only the indices up to -24 are actually calculated and anything after that are disregarded. So a computer with a 32-bit system will represent 1.3 as 1.010011001100110011001100 in binary. And that my friends, is a floating point number. Floating point numbers sacrifice accuracy and range for speed of calculation. The reason they are called “floating point” numbers is due to the fact that the decimal can arbitrarily change. For example, while 1.3 may be represented as 1.010011001100110011001100, 10100.11001100110011001100 is equivalent to 20.799999237060546875 in base 10. Now, the reason I say that they sacrifice accuracy for speed is due to the fact that truncation reduces the number of bits that need to be allocated for the number, but in turn, leaves out indices that make up the actual number. If we were to convert 1.010011001100110011001100 back to decimal, we would actually get 1.2999999523162841796875, not 1.3. As you can see, the answer is essentially correct, but technically not exact. This is why floating point rounding errors can occur when doing basic addition, as accuracy must be sacrificed for speed and storage. If you want to add 1.3 and 0.6, then we would have to convert both into binary. So it would be 1.010011001100110011001100 + 0.100110011001100110011001 in a 32-bit system. Then the result would be close to 1.9, but not exactly 1.9 since truncation had to occur.

As stated before, the major reason why floating point numbers are preferred over a system like fixed-point representation is due to processing speed and memory. With floating point numbers, it is possible to represent a greater amount of numbers before overflow occurs because not as many bits need to be used to represent one number.

Hopefully, you understand a bit more about floating point numbers and how computers deal with numbers internally. If you are looking to broaden your knowledge even further, I suggest watching this excellent video by Computerphile, that goes into more detail regarding floating point numbers and their applications.