The present invention relates generally to the field of floating-point number computations and more particularly to the precision of a calculation involving the addition of multiple floating-point numbers.
The value of a number in a computer (a computer number) is expressed in an encoding of one or more adjacent patterns of bits that comprise the computer number, and whose format depends on the type of the number and is defined in the architecture of an instruction set. Fixed point, binary coded decimal, and floating-point are examples of types of numbers defined in instruction sets. A computer interprets the bits of a number and computer instructions manipulate the number according to the number's type and its base. A base assigns a value to a bit that depends on the position of the bit in the number relative to a radix point, an implicit location in the number defined by the architecture or a standard to which the architecture adheres. In a “floating-point” number, the radix point is fixed by the architecture, but can assume various positions in the values that a floating-point number can represent, i.e., the radix point is not fixed, i.e., it can “float”, in the various values that a floating-point number can represent.
Floating-point numbers can represent extremely large and small numbers with a fixed number of bits; essentially accomplished by representing the value of a number as a product of two values in separate bit-fields within the number. A base-2 (binary) floating-point number, for example, is expressed with a sign bit (indicating positive or negative), a fraction (sometimes called a mantissa) and an exponent. An implied bit (implicit bit) that is associated with the fraction, that is deduced and not physically present, is often used in binary floating-point numbers to increase the number of bits available to express the fraction or exponent. The fraction becomes a significand when the implied bit associated with the fraction is appended to the fraction. The exponent in a computer number is often not the real-exponent that is used to evaluate the number. A value called a bias is usually added to the real-exponent when the floating-point number is formed. The real-exponent is then the exponent minus the bias.
A floating-point number is evaluated by multiplying the significand by the base raised to the power of the real-exponent (i.e., the base times itself, the real-exponent number of times). Let x be the value of a floating-point number, then x=±1×significand×baseexponent−bias. For example, if the base is 2 and the bias is 128, as in one type of floating-point number in IEEE floating-point format, then x=±1×significand×2exponent−128. If the base is 16 and the bias is 64, as in one type of floating-point number in a floating-point hexadecimal format, then x=±1×significand×16exponent−64.
Due to the limited number of bits available in a floating-point number to express the fraction and exponent, an effect called “catastrophic cancellation” can occur when multiple floating-point numbers are added that differ significantly in magnitude. A small value added to a large value can overflow the number of bits available in the fraction of the result and cause some or all of the small value to be lost. When two floating-point numbers are added, the exponent of the number with a smaller exponent is increased to equal that of the larger number, and the fraction in the smaller number is bit-shifted right to compensate for the increased exponent (effectively dividing the number to compensate for the multiplication caused by increasing the exponent). Sometimes the fraction of the smaller number can be bit-shifted to the right far enough to cause bits on the right end of the fraction to be lost. For example, given two floating-point numbers, a and b, where a is much smaller than b, the sequence of additions, b+a−b can incorrectly produce a result that differs from a. While a re-ordered sequence of additions, b−b+a will produce the correct result a, the value of a floating-point variable in a program can be unknown in advance of a calculation, therefore a potential altering of a sequence of floating-point operations to increase the precision of a result is often impractical.