Floating Point Numbers

Real Numbers: pi = 3.14159265... e = 2.71828...

Scientific Notation: has a single digit to the left of the decimal point.

A number in Scientific Notation with no leading 0s is called a Normalised Number: 1.0 × 10-8

Not in normalised form: 0.1 × 10-7 or 10.0 × 10-9

Can also represent binary numbers in scientific notation: 1.0 × 2-3

Computer arithmetic that supports such numbers is called Floating Point.

The form is 1.xxxx… × 2yy…

Using normalised scientific notation

  1. Simplifies the exchange of data that includes floating-point numbers
  2. Simplifies the arithmetic algorithms to know that the numbers will always be in this form
  3. Increases the accuracy of the numbers that can be stored in a word, since each unnecessary leading 0 is replaced by another significant digit to the right of the decimal point


Representation of Floating-Point numbers

-1S × M × 2E

Bit NoSizeField Name
311 bit Sign (S)
23-308 bitsExponent (E)
0-2223 bitsMantissa (M)

A Single-Precision floating-point number occupies 32-bits, so there is a compromise between the size of the mantissa and the size of the exponent.

These chosen sizes provide a range of approx:

± 10-38 ... 1038

To reduce the chances of underflow/overflow, can use 64-bit Double-Precision arithmetic

Bit NoSizeField Name
631 bit Sign (S)
52-6211 bitsExponent (E)
0-5152 bitsMantissa (M)

providing a range of approx

± 10-308 ... 10308


These formats are called ...

IEEE 754 Floating-Point Standard

Since the mantissa is always 1.xxxxxxxxx in the normalised form, no need to represent the leading 1. So, effectively:

  • Single Precision: mantissa ===> 1 bit + 23 bits
  • Double Precision: mantissa ===> 1 bit + 52 bits
  • Since zero (0.0) has no leading 1, to distinguish it from others, it is given the reserved bitpattern all 0s for the exponent so that hardware won't attach a leading 1 to it. Thus:

    If we number the mantissa bits from left to right m1, m2, m3, ...

    mantissa = m1 × 2-1 + m2 × 2-2 + m3 × 2-3 + ....

    Negative exponents could pose a problem in comparisons.

    For example (with two's complement):

     SignExponentMantissa
    1.0 × 2-10111111110000000 00000000 00000000
    1.0 × 2+10000000010000000 00000000 00000000

    With this representation, the first exponent shows a "larger" binary number, making direct comparison more difficult.

    To avoid this, Biased Notation is used for exponents.

    If the real exponent of a number is X then it is represented as (X + bias)

    IEEE single-precision uses a bias of 127. Therefore, an exponent of

    -1 is represented as -1 + 127 = 126 = 011111102
     0 is represented as  0 + 127 = 127 = 011111112
    +1 is represented as +1 + 127 = 128 = 100000002
    +5 is represented as +5 + 127 = 132 = 100001002

    So the actual exponent is found by subtracting the bias from the stored exponent. Therefore, given S, E, and M fields, an IEEE floating-point number has the value:

    -1S × (1.0 + 0.M) × 2E-bias

    (Remember: it is (1.0 + 0.M) because, with normalised form, only the fractional part of the mantissa needs to be stored)


    Floating Point Addition

    Add the following two decimal numbers in scientific notation:

    8.70 × 10-1 with 9.95 × 101

    1. Rewrite the smaller number such that its exponent matches with the exponent of the larger number.

      8.70 × 10-1 = 0.087 × 101

    2. Add the mantissas

      9.95 + 0.087 = 10.037 and write the sum 10.037 × 101

    3. Put the result in Normalised Form

      10.037 × 101 = 1.0037 × 102 (shift mantissa, adjust exponent)

      check for overflow/underflow of the exponent after normalisation

    4. Round the result

      If the mantissa does not fit in the space reserved for it, it has to be rounded off.

      For Example: If only 4 digits are allowed for mantissa

      1.0037 × 102 ===> 1.004 × 102

    5. (only have a hidden bit with binary floating point numbers)


    Example addition in binary

    Perform 0.5 + (-0.4375)

    0.5 = 0.1 × 20 = 1.000 × 2-1 (normalised)

    -0.4375 = -0.0111 × 20 = -1.110 × 2-2 (normalised)

    1. Rewrite the smaller number such that its exponent matches with the exponent of the larger number.

      -1.110 × 2-2 = -0.1110 × 2-1

    2. Add the mantissas:

      1.000 × 2-1 + -0.1110 × 2-1 = 0.001 × 2-1

    3. Normalise the sum, checking for overflow/underflow:

      0.001 × 2-1 = 1.000 × 2-4

      -126 <= -4 <= 127 ===> No overflow or underflow

    4. Round the sum:

      The sum fits in 4 bits so rounding is not required

      Check: 1.000 × 2-4 = 0.0625 which is equal to 0.5 - 0.4375

    Correct!


    Floating Point Multiplication

    Multiply the following two numbers in scientific notation by hand:

    1.110 × 1010 × 9.200 × 10-5

    1. Add the exponents to find

      New Exponent = 10 + (-5) = 5

      If we add biased exponents, bias will be added twice. Therefore we need to subtract it once to compensate:

      (10 + 127) + (-5 + 127) = 259

      259 - 127 = 132 which is (5 + 127) = biased new exponent

    2. Multiply the mantissas

      1.110 × 9.200 = 10.212000

      Can only keep three digits to the right of the decimal point, so the result is

      10.212 × 105

    3. Normalise the result

      1.0212 × 106

    4. Round it

      1.021 × 106


    Example multiplication in binary:

    1.000 × 2-1 × -1.110 × 2-2

    1. Add the biased exponents

      (-1 + 127) + (-2 + 127) - 127 = 124 ===> (-3 + 127)

    2. Multiply the mantissas

                    1.000
                 ×  1.110
                 -----------
                        0000
                       1000
                      1000
                 +   1000
                 -----------
                     1110000  ===> 1.110000
      
      The product is 1.110000 × 2-3
      Need to keep it to 4 bits 1.110 × 2-3

    3. Normalise (already normalised)

      At this step check for overflow/underflow by making sure that

      -126 <= Exponent <= 127

      1 <= Biased Exponent <= 254

    4. Round the result (no change)
    5. Adjust the sign.

      Since the original signs are different, the result will be negative

      -1.110 × 2-3


    Further Reading

    IEEE-754 References and Conversion and another Converter


    [ Index ]

    last updated: 2-Dec-04 Ian Harries <ih@doc.ic.ac.uk>