Floating Point Numbers
Real Numbers: pi = 3.14159265... e = 2.71828...
Scientific Notation: has a single digit to the left of the decimal point.
A number in Scientific Notation with no leading 0s is called a Normalised Number: 1.0 × 10-8
Not in normalised form: 0.1 × 10-7 or 10.0 × 10-9
Can also represent binary numbers in scientific notation: 1.0 × 2-3
Computer arithmetic that supports such numbers is called Floating Point.
The form is 1.xxxx × 2yy
Using normalised scientific notation
Representation of Floating-Point numbers
-1S × M × 2E
Bit No | Size | Field Name |
---|---|---|
31 | 1 bit | Sign (S) |
23-30 | 8 bits | Exponent (E) |
0-22 | 23 bits | Mantissa (M) |
A Single-Precision floating-point number occupies 32-bits, so there is a compromise between the size of the mantissa and the size of the exponent.
These chosen sizes provide a range of approx:
± 10-38 ... 1038
The exponent is too large to be represented in the Exponent field
The number is too small to be represented in the Exponent field
To reduce the chances of underflow/overflow, can use 64-bit Double-Precision arithmetic
Bit No | Size | Field Name |
---|---|---|
63 | 1 bit | Sign (S) |
52-62 | 11 bits | Exponent (E) |
0-51 | 52 bits | Mantissa (M) |
providing a range of approx
± 10-308 ... 10308
These formats are called ...
IEEE 754 Floating-Point Standard
Since the mantissa is always 1.xxxxxxxxx in the normalised form, no need to represent the leading 1. So, effectively:
Since zero (0.0) has no leading 1, to distinguish it from others, it is given the reserved bitpattern all 0s for the exponent so that hardware won't attach a leading 1 to it. Thus:
If we number the mantissa bits from left to right m1, m2, m3, ...
mantissa = m1 × 2-1 + m2 × 2-2 + m3 × 2-3 + ....
Negative exponents could pose a problem in comparisons.
For example (with two's complement):
Sign | Exponent | Mantissa | |
---|---|---|---|
1.0 × 2-1 | 0 | 11111111 | 0000000 00000000 00000000 |
1.0 × 2+1 | 0 | 00000001 | 0000000 00000000 00000000 |
With this representation, the first exponent shows a "larger" binary number, making direct comparison more difficult.
To avoid this, Biased Notation is used for exponents.
If the real exponent of a number is X then it is represented as (X + bias)
IEEE single-precision uses a bias of 127. Therefore, an exponent of
-1 is represented as -1 + 127 = 126 = 011111102 |
0 is represented as 0 + 127 = 127 = 011111112 |
+1 is represented as +1 + 127 = 128 = 100000002 |
+5 is represented as +5 + 127 = 132 = 100001002 |
So the actual exponent is found by subtracting the bias from the stored exponent. Therefore, given S, E, and M fields, an IEEE floating-point number has the value:
-1S × (1.0 + 0.M) × 2E-bias |
(Remember: it is (1.0 + 0.M) because, with normalised form, only the fractional part of the mantissa needs to be stored)
Floating Point Addition
Add the following two decimal numbers in scientific notation:
8.70 × 10-1 with 9.95 × 101
8.70 × 10-1 = 0.087 × 101
9.95 + 0.087 = 10.037 and write the sum 10.037 × 101
10.037 × 101 = 1.0037 × 102 (shift mantissa, adjust exponent)
check for overflow/underflow of the exponent after normalisation
If the mantissa does not fit in the space reserved for it, it has to be rounded off.
For Example: If only 4 digits are allowed for mantissa
1.0037 × 102 ===> 1.004 × 102
(only have a hidden bit with binary floating point numbers)
Example addition in binary
Perform 0.5 + (-0.4375)
0.5 = 0.1 × 20 = 1.000 × 2-1 (normalised)
-0.4375 = -0.0111 × 20 = -1.110 × 2-2 (normalised)
-1.110 × 2-2 = -0.1110 × 2-1
1.000 × 2-1 + -0.1110 × 2-1 = 0.001 × 2-1
0.001 × 2-1 = 1.000 × 2-4
-126 <= -4 <= 127 ===> No overflow or underflow
The sum fits in 4 bits so rounding is not required
Check: 1.000 × 2-4 = 0.0625 which is equal to 0.5 - 0.4375
Correct!
Floating Point Multiplication
Multiply the following two numbers in scientific notation by hand:
1.110 × 1010 × 9.200 × 10-5
New Exponent = 10 + (-5) = 5
If we add biased exponents, bias will be added twice. Therefore we need to subtract it once to compensate:
(10 + 127) + (-5 + 127) = 259 259 - 127 = 132 which is (5 + 127) = biased new exponent |
1.110 × 9.200 = 10.212000
Can only keep three digits to the right of the decimal point, so the result is
10.212 × 105
1.0212 × 106
1.021 × 106
Example multiplication in binary:
1.000 × 2-1 × -1.110 × 2-2
(-1 + 127) + (-2 + 127) - 127 = 124 ===> (-3 + 127)
1.000 × 1.110 ----------- 0000 1000 1000 + 1000 ----------- 1110000 ===> 1.110000 |
The product is 1.110000 × 2-3 |
Need to keep it to 4 bits 1.110 × 2-3 |
At this step check for overflow/underflow by making sure that
-126 <= Exponent <= 127
1 <= Biased Exponent <= 254
Since the original signs are different, the result will be negative
-1.110 × 2-3
Further Reading
IEEE-754 References and Conversion and another Converter
last updated: 2-Dec-04 Ian Harries <ih@doc.ic.ac.uk>