Floating Point Primitive Types

Course list http://www.c-jump.com/bcc/

Floating Point Primitive Types

Rules and Definitions
Example
IEEE 754 Converter
C++ Floating Point Range
Practice
Step-by-step
IEEE Answers

Rules and Definitions

Integers are stored precisely. If the size of an integer is larger than the space allowed, an error will occur (the compiler may not tell you - you will just get strange output.)
If you are working with numbers that have a fractional component (or if in a rare occasion we are using integers that don't fit a long data type) then we can use floating point numbers.
For floating point numbers, if the number of significant digits is larger than the number of bits available, the number is simply rounded to the closest value that can be represented.
This means that number may not be stored exactly. However, this loss of precision is often not a problem. We refer to the error as roundoff-error.
Floating point numbers (i.e. numbers with a decimal point) are represented using what is called the IEEE format. C++ float data type stores a single precision number in 32 bits as follows:
The IEEE standard designates bits to compute the floating point numbers using the formula
```
        sign * 2^exponent * mantissa
```

where exponent and mantissa are computed as

        exponent = e - 127
        mantissa = 1.m

Example

What is the floating point number stored as

        0001 0001 0100 1000 0000 0000 0000 0000

Answer:
```
        1.5777218E-28
```

IEEE 754 Converter

There are a few floating point converters available online. For example,
see IEEE 754 Converter by Harald Schmidt.

C++ Floating Point Range

C/C++ Floating point types accommodate numbers in the following ranges:


     float: 32 bits, range +/- 3.4 * 10^38

    double: 64 bits, range +/- 1.8 * 10^308

Practice

Convert to IEEE single precision floating point format (stored in 32 bits):
```
        (a)  -.125

        (b)  783

        (c)  .0390625
```

Step-by-step

We need to fill in the sign, exponent, and mantissa in

    sign * 2^exponent * mantissa

where exponent and mantissa are computed as

    exponent = e - 127  ( 8 bits )
    mantissa = 1.m ( 23 bits ) ( 1 is not stored because it's not needed)

The dot is called the radix point. From left to right, binary digits of mantissa represent combinations the of following base-2 fractions:
```
  1/2 = 0.5
  1/4 = 0.25
  1/8 = 0.125
 1/16 = 0.0625
 1/32 = 0.03125
 1/64 = 0.015625
1/128 = 0.0078125
  ...   ...
```

For example,

    3.75 = 3 + 0.5 + 0.25 = 11.11 = 1.111 * (2^1)
    sign = 0  exponent = 10000000 ( 8 bits )  mantissa = 111000...00 ( 23 bits )

Note: fractions that can be fully assempled from base-2 fractions listed above are called dyadic fractions.
To compute mantissa (binary equivalent of fractional part after the radix point) of a non-dyadic fraction, such as 3.14, execute a series of 23 steps multiplying the decimal fractional part .14 number by 2. Integral part of resultant decimal numbers are the digits we use in binary mantissa, from left to right:
```
    3.14 = 3 + 0.28 + 0.56 + 1.12 + 0.24 + 0.48 + 0.96 + 1.92 + 1.84 + 1.68 + 1.36 + 0.72 + 1.44
               0      0      1      0      0      0      1      1      1      1      0      1
           11. 001000111101...
               \____m________/ 23 bits
```

Binary fraction such as 11.001000111101 needs to be normalized. In normalized version of the binary fraction the integral part is always 1.

    11.001000111101 =
    1.1001000111101 * (2^1)

    sign = 0 (+)
    exponent = 127 + 1 = 10000000 ( 8 bits )
    mantissa = 1001000111101... ( 23 bits )

When normalizing, we have to adjust the radix point to the <-left or to the right-> by using positive and negative exponent, respectively:


     ...                 ...
    2^-2 : e = 127 - 2 = 125
    2^-1 : e = 127 - 1 = 126 negative power moves radix point to the right->
     2^0 : e = 127 + 0 = 127
    2^+1 : e = 127 + 1 = 128 positive power moves radix point to the <-left
    2^+2 : e = 127 + 2 = 129
     ...                 ...

IEEE Answers

        (a)    -.125 = 1 01111100 00000000000000000000000

        (b)      783 = 0 10001000 10000111100000000000000

        (c) .0390625 = 0 01111010 01000000000000000000000