Floating-point arithmetic has become indispensible to modem computing. It is used in fields as diverse as gaming, financial analysis, high performance scientific computing, and numerous others. However, despite this widespread usage, few developers take into account the approximate nature of such computations.
The current standard in floating-point arithmetic is the IEEE-754-2008 standard (see “IEEE Standard for Floating-point Arithmetic”, IEEE 754-2008, pp. 1-58, August 2008). It uses a finite number of bits to describe an infinite space by dividing the bits into three parts: the sign bit, the exponent and the mantissa. The number of bits in the exponent controls the interval that may be represented and the number of bits in the mantissa controls the precision. The IEEE-754 standard contains guarantees on the precision of the result of a computation and on the error of a single operation requiring that the result of an elementary arithmetic operation (addition, subtraction, multiplication, division, and square root) be within 0.5 Units of Least Precision (ULP) of the mathematically exact result. However, one of its shortcomings is the lack of accuracy guarantees on the result of a sequence of operations.
The IEEE-754 floating-point standard defines several levels of precision for binary floating-point numbers: a single precision floating-point number has an 8 bit exponent, a 24 (23+1) bit mantissa and a sign bit; a double precision floating-point number has an 11-bit exponent, a 53 (52+1) bit mantissa, and a sign bit; and a quad precision floating-point number has a 15-bit exponent, a 113 (112+1) bit mantissa and a sign bit. The precision of a floating-point has a direct impact on its accuracy.
There is a rounding error associated with each floating-point operation, however, the error of a single operation is easily quantified. The error associated with any operation is the potential rounding that occurs after the operation has been executed. This rounding, which could be due to any rounding mode, always affects the least significant bit of the mantissa, and it is at most 1 Unit of Least Precision (ULP). This level of error is acceptable in most applications for a single floating-point operation since the error is limited to the least significant bit.
However, a greater problem arises from error propagation, when the error that was limited to the least significant bit of the mantissa gets promoted to a more significant bit resulting in a greater error. This can happen in two different scenarios: gradual error and catastrophic error. The gradual error, or “creeping crud”, is caused by a series of operations that gradually promote the bit containing the initial rounding error to a more significant place. On the other hand, catastrophic error, or left digit destruction, occurs when the bit that contains the rounding error is propagated to a much more significant position in the mantissa in a single operation. A simple example of when this can happen is the subtraction of two numbers that should be the same, but which due to rounding errors are a single bit off in the least significant bit of the mantissa. The problem begins when the remaining bit gets moved to a more significant position in the mantissa after the subtraction. This problem can be exacerbated if the numbers being subtracted are large, because in that case the erroneous mantissa would be raised to a large exponent as illustrated in the example below.