Author Topic: [Paper] Internal representation of floating point numbers (ANSI/IEEE 754) (Read 4869 times)

Deque · « **on:** February 16, 2013, 07:10:51 pm »

Internal representation of floating point numbers (ANSI/IEEE 754)

This paper explains by example how floating point numbers are represented internally (i.e. float data type in C). This representation is specified in the IEEE Standard for Floating-Point Arithmetic (IEEE 754). It is a must know for programmers in my opinion. I got asked about this topic and decided to create a paper instead of explaining it via PM.

A 4-Byte floating point number, also known as single precision, has the following representation:

sign	characteristic (c)	significant bits (s)
1 bit	8 bit	23 bit

where:
exponent = c - 127
(don't bother about that, it is just a definition we use later)

The following formula is used to get the number as we know it:

Code: [Select]

(-1)^(sign) * 2^(c-127) * ((1.s)(bin))
Example for conversion from decimal to IEEE 754:
-19.625 (dec)

You set sign = 1, because it is negative

Now you need the exponent. First convert -19.625 to binary:
-19.625 (dec) = -10011.101(bin)

Now you can see that you need exponent = 4 to move the dot to the first number:
-10011.101(bin) * 2^4 = -1.0011101(bin)
Calculate the characteristic from the exponent:

4 = c - 127
c = 131 = (10000011)(bin)

Extract s from 1.0011101(bin)
1.s = 1.0011101
s = 0011101 ... fill the rest with 0 (s has 23 bits)

Thus our internal representation is:
sign + c + s
11000001100111010000000000000000

Use this website to exercise and understand: http://www.h-schmidt.net/FloatConverter/IEEE754.html

Example for conversion from IEEE 754 back to decimal:

Now lets make an example to convert a number from IEEE 754 representation back to dec:
01000001101110011000001100010010
You just have to use the formula given above

First bit is the sign = 0, so it is positive

Next 8 bits are c: c = 10000011(bin) = 131

The rest is s: s = 01110011000001100010010
Prepend a 1 (which is implicit): 1.s = 1.01110011000001100010010

Using the formula the number is: 1^0 * 2^131 * (1.01110011000001100010010)(bin) = 23.189

Special cases:

The examples above are only the normalised case.
For pretty small values the representation is changed, so that you can represent them.

If c = 0 and s = 0: The value is 0
If c = 0 and s != 0: The formula is (-1)^(sign) * 2^(-126) * (0.s)(bin)

Example:
2^-126 * 2^-23 = 2^-149 = 1.4013 * 10^-45
This is the smallest possible positive number.

If c = 255 and s = 0: "(-1)^(sign) * ∞" (--> -∞ or ∞)
If c = 255 and s != 0: NaN (not a number, i.e. for sqrt(-2), 0/0, (0 * ∞))

Sources: Study notes from university

EvilZone

News:

Author Topic: [Paper] Internal representation of floating point numbers (ANSI/IEEE 754) (Read 4869 times)

Deque

[Paper] Internal representation of floating point numbers (ANSI/IEEE 754)