A real number is represented by a whole number and a fractional part; the fractional part of the number is called the mantissa. The term floating point means that there is no fixed number of digits before and after the decimal point of mantissa; the decimal point can float in contrast to fixed-point numbers in which the number of digits before and after the decimal point is set. Computers represent most floating-point numbers as approximations using complex codes so calculations of floating-point numbers are generally slower and less accurate than fixed-point representations, but can represent a larger range of numbers. In fact, one of the challenges in developing computers with floating-point values is ensuring that the approximations lead to reasonable results. If the programmer is not careful, small discrepancies in the approximations can snowball to the point where the final results are absurd.
Calculating floating-point numbers requires a great deal of computing power and most microprocessors come with a chip, called a floating point unit (FPU), specialized for performing floating-point arithmetic. FPUs are also called math coprocessors and numeric coprocessors. The most popular code for representing real numbers is called the IEEE Floating-Point Standard. An IEEE-754 floating point number, also just called a “float”, has three components: a sign bit telling whether the number is positive or negative, an exponent giving its order of magnitude, and a mantissa specifying the actual digits of the number; there is also an analogous 96-bit extended-precision format under IEEE-854. Using single-precision floats as an example, here is the layout of the 32 bits of a four byte word:
sxxxxxxxxmmmmmmmmmmmmmmmmmmmmmmm31     0where s=sign bit, x=exponent, m=mantissa. The IEEE long real or double precision has one bit for the sign, eleven bits for the exponent, and 52 bits for the mantissa. The value of the number is the mantissa times 2x, where x is the exponent. There is a potential problem with storing both a mantissa and an exponent because, recall, with a floating point number, the decimal place can float so the number 0.2 can be represented as 2×10−1=0.2×100=0.02×101, etc. In the IEEE standard, the problem is circumvented by normalizing the whole mantissa as being to the right of the decimal point, with an implied “1” always present to the left of the decimal. For example, decimal 1234.567 is normalized as 1.234567×103 by moving the decimal point so that only one digit appears before the decimal. A special case exists when the number is zero; then every bit is zero and the sign bit is irrelevant. Zero is not the only “special case,” there are also representations for positive and negative infinity, and for a not-a-number (NaN) value for results that do not make sense such as the result of a divide-by-zero or an infinity-times-zero operation. A number is infinite if every bit of the exponent is set and is NaN if every bit of the exponent is set plus any mantissa bits are set. The sign bit still distinguishes +/−infinity and +/−NaN. The table below presents certain characteristics of both single- and double-precision IEEE floating point numbers:
PropertyValue for floatValue for doubleLargest representable number3.402823466e+381.7976931348623157e+308Smallest number w/ precision1.175494351e−382.2250738585072014e−308Smallest representable number1.401298464e−45         5e−324Mantissa bits2352Exponent bits 811Epsilon is the smallest x such that 1.1929093e−7 2.220446049250313e−161 + x > 1; It is the place value ofthe least significant bit when the exponentis zero.
Floating point arithmetic uses sophisticated numerical algorithms. Because of the number of bits used to represent a floating point number, it is difficult to ascertain when two floating point numbers are equal, i.e., it is relatively rate for the numbers match bit for bit, especially when some of bits might be approximations and/or incorrect. For example, in the table above, epsilon represents how to determine if two numbers are/are not equal. The precision of a floating point number is measured in significant digits and determined by the number of correct bits. Although a 32-bit integer can represent any nine-digit decimal number, a 32-bit floating point number offers only about seven digits of precision.
Uncertain equality is only the tip of the iceberg of problems caused by limited accuracy and precision. Loss of significance refers to a class of situations wherein precision and information have been inadvertently lost with the potential of ending up with laughably bad results. To simplify things, one way to consider the problem of loss of precision is that a float gradually gets “corrupted” as more and more operations are performed. For example, a fraction such as ⅕ or 0.2 must be represented by a sum of fractions whose denominators are powers of 2. Below is the output from a program that subtracts each successive fraction from 0.2 and shows each remainder. As can be seen below, an exact value is not realized even after creating the 23 mantissa bits. The result, however, is accurate to only seven digits.
starting: 0.200000000000subtracting 0.125000000000remainder = 0.075000000000subtracting 0.062500000000remainder = 0.012500000000subtracting 0.007812500000remainder = 0.004687500000subtracting 0.003906250000remainder = 0.000781250000subtracting 0.000488281250remainder = 0.000292968750subtracting 0.000244140625remainder = 0.000048828125subtracting 0.000030517578remainder = 0.000018310547subtracting 0.000015258789remainder = 0.000003051758subtracting 0.000001907349remainder = 0.000001144409subtracting 0.000000953674remainder = 0.000000190735subtracting 0.000000119209remainder = 0.000000071526Mantissa: .00110011001100110011001.
In some computer hardware architectures, such as the POWERPC, floating point operations for arithmetic, rounding, and conversion may produce an intermediate result that may be regarded as being infinitely precise. This infinitely precise result, however, must be written into a floating point register having a finite number of bits. After normalization or denormalization, if the infinitely precise intermediate result is not representable in the precision required by the instruction then it is rounded before being placed into the target floating point register.
Hardware implementations for computer hardware floating point divide and square root instructions may use a Newton-Raphson algorithm or a power series approximation in a fused multiply-add arithmetic pipeline. A series of multiply-add instructions implement the algorithms and the result is rounded up or down in order to conform to the IEEE-754 floating point standard. Correctly rounding usually requires extra precision in the arithmetic dataflow mandating extra hardware, e.g., increased gate counts, power, and logic delay. For instance, one method that ensures that the confidence level of floating point calculations is good enough has been to make the whole hardware pipeline one bit wider and then to calculate the remainder. There is thus a need for a more efficient means to accomplish the required rounding.