Floating Point Examples and Problems

Here is how to convert numbers into IEEE floating point format. The 32-bit floating point format, called float in C, has 1 sign bit, 8 exponent bits, and 23 mantissa bits. The 64-bit format, double, has 1 sign bit, 11 exponent bits, and 52 mantissa bits.

Example 1: Convert to float

Convert 14564 to a float.
 14564 = 8192 + 6372 = 8192 + 4096 + 2276 = 8192 + 4096 + 2048 + 228
    228 = 128 + 64 + 32 + 4
14564 = 11100011100100 base 2
We move the decimal point (binary point) to just behind the first digit,
by dividing by 2^13 (moving it 13 places to the left)
   = 1.1100011100100 base 2   times 2^13
The mantissa is 1.1100011100100 and the exponent is 13. We encode these into the 23-bit mantissa field and the 8 bit exponent field. We create the 23-bit mantissa field by adding 0s to the right end of the mantissa until it has 24 digits, then dropping the 1 in front of the decimal point.
1.11000111001000000000000, mantissa field is 11000111001000000000000
We create the exponent field by adding the bias 127 ( which is 0111111 base 2) to the exponent, and writing the result in binary:
13 + 127 = 140 = 10001100 base 2
This can also be computed by adding 13-1 = 12 to 128 in binary:
10000000 + 1100 = 10001100 base 2, since 13 + 127 = 13 + 128 - 1 = 128 + (13-1)
The mantissa field always has 23 bits, and the exponent field has 8 bits. Leading zeros are always kept. Finally, a sign bit of 0, the exponent field, and the mantissa field are all concatenated into a single 32-bit binary number:
0  10001100  11000111001000000000000
01000110011000111001000000000000
0100 0110 0110 0011 1001 0000 0000 0000
0x46639000
The final answer is 0x46639000, the hexadecimal representation of the 32-bit floating point representation of the number 14564.

Example 2: A negative number as a double

Let's convert -201 to 64-bit floating point representation. To do this we will represent 201 as a double, then simply flip the sign bit (we don't use two's complement notation in floating-point). Doubles have an 11-bit exponent field and a 52-bit mantissa field. I call the bit patterns the exponent field and the mantissa field, and I call the actual exponent, and the actual mantissa, two mathematical numbers, without the bias added or 1 stripped off, the exponent and mantissa.
201 = 128 + 73 = 128 + 64 + 9 = 128 + 64 + 8 + 1 = 11001001 base 2
11001001 base 2 = 1.1001001 times 10000000 base 2 = 1.1001001 base 2 times 2^7
exponent = 7, mantissa = 1.1001001
We add the bias, 01111111111 base 2, to the exponent, 111 base 2, to get the exponent field.  I will add 10000000000 to 111 and subtract 1, getting
10000000110 as the exponent field.
I add 0s to the end of the mantissa, getting
1.1001001000000000000000000000000000000000000000000000
and strip off the initial 1 to get the 52-bit mantissa field:
1001001000000000000000000000000000000000000000000000
I combine a sign bit of 1 (negative) with the exponent field and the mantissa field to get
1  10000000110  1001001000000000000000000000000000000000000000000000
= 1100000001101001001000000000000000000000000000000000000000000000
= 1100 0000 0110 1001 0010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
= 0xC069200000000000
The hex representation of the double representation of -201 is 0xC069200000000000.

Example 3: A fractional number converted to floating point

Convert the number 65.65625 to a float.
It is tricky to convert a fractional number to floating point representation. Most fractional numbers have no exact floating point representation. Just as the fraction 1/3 is represented as 0.33333333..., and if you terminate the decimal at any point, you have a number close to, but not exactly equal to 1/3, the binary fractional representation of 0.1 (decimal) is 0.00011001100110011001100... (binary). I have picked a number that does have an exact binary representation, so the best solution is to find the floating point representation of 1024 times that number, and then subtract 10 from the exponent. This way, we have found the floating point version of X times 2^10, then divided it by 2^10.
65.65625 * 1024
= 67232 = 65536 + 1696 = 2^16 + 1024 + 512 + 160
= 10000011010100000
= 1.0000011010100000 * 2^16
mantissa = 1.00000110101000000000000 exponent = 16
mantissa field = 00000110101000000000000  exponent field = 10001111 (=128 + 16 -1)
floating point of 67232 = 0 10001111 00000110101000000000000
Subtract 10 from exponent = subtract 10 from exponent field to get
floating point representation of 65.65625:
 0  10000101 00000110101000000000000
0100 0010 1000 0011 0101 0000 0000 0000
0x42835000

Some problems to work on

Convert the following to floats (32-bit IEEE floating point representation)
  1. 2648
  2. 23.25
  3. -135
  4. 4
  5. -0.01171875
  6. 1234567
Convert these to doubles (64-bit IEEE FP)
  1. 39
  2. 4639.75
  3. -456
  4. 859.0859375

Answers to problems