Internal representation of floating point numbers (ANSI/IEEE 754)This paper explains by example how floating point numbers are represented internally (i.e. float data type in C). This representation is specified in the IEEE Standard for Floating-Point Arithmetic (IEEE 754). It is a must know for programmers in my opinion. I got asked about this topic and decided to create a paper instead of explaining it via PM.
A 4-Byte floating point number, also known as single precision, has the following representation:
sign | characteristic (c) | significant bits (s) |
1 bit | 8 bit | 23 bit |
where:
exponent = c - 127
(don't bother about that, it is just a definition we use later)
The following formula is used to get the number as we know it:
(-1)^(sign) * 2^(c-127) * ((1.s)(bin))
Example for conversion from decimal to IEEE 754:-19.625 (dec)
You set sign = 1, because it is negative
Now you need the exponent. First convert -19.625 to binary:
-19.625 (dec) = -10011.101(bin)
Now you can see that you need exponent = 4 to move the dot to the first number:
-10011.101(bin) * 2^4 = -1.0011101(bin)
Calculate the characteristic from the exponent:
4 = c - 127
c = 131 = (10000011)(bin)
Extract s from 1.0011101(bin)
1.s = 1.0011101
s = 0011101 ... fill the rest with 0 (s has 23 bits)
Thus our internal representation is:
sign + c + s
11000001100111010000000000000000
Use this website to exercise and understand:
http://www.h-schmidt.net/FloatConverter/IEEE754.htmlExample for conversion from IEEE 754 back to decimal:Now lets make an example to convert a number from IEEE 754 representation back to dec:
01000001101110011000001100010010
You just have to use the formula given above
First bit is the sign = 0, so it is positive
Next 8 bits are c: c = 10000011(bin) = 131
The rest is s: s = 01110011000001100010010
Prepend a 1 (which is implicit): 1.s = 1.01110011000001100010010
Using the formula the number is: 1^0 * 2^131 * (1.01110011000001100010010)(bin) = 23.189
Special cases:The examples above are only the normalised case.
For pretty small values the representation is changed, so that you can represent them.
If c = 0 and s = 0: The value is 0
If c = 0 and s != 0: The formula is (-1)^(sign) * 2^(-126) * (0.s)(bin)
Example:
2^-126 * 2^-23 = 2^-149 = 1.4013 * 10^-45
This is the smallest possible positive number.
If c = 255 and s = 0: "(-1)^(sign) * ∞" (--> -∞ or ∞)
If c = 255 and s != 0: NaN (not a number, i.e. for sqrt(-2), 0/0, (0 * ∞))
Sources: Study notes from university