In the design of floating point arithmetic systems for use in a floating point processing device, it is desirable that results are consistent to achieve conformity in the calculations and solutions to problems even though the problems are solved using different computer systems.
An American national standard has been developed in order to provide a uniform system of rules for governing the implementation of floating point arithmetic systems. This standard is identified as IEEE Standard No. 754-2008 and international standard ISO/IEC/IEEE 60599:2011, which are both incorporated by reference herein. The standard specifies basic and extended floating point number formats, arithmetic operations, conversions between integer and floating point formats, conversions between different floating point formats, conversions between basic format floating point numbers and decimal strings, and the handling of certain floating point exceptions.
The typical floating point arithmetic operation may be accomplished using formats of various (usually standard) widths (for example, 32-bit, 64-bit, etc.). Each of these formats utilizes a sign, exponent and fraction field (or significand), where the respective fields occupy predefined portions of the floating point number. For example, in the case of a 32-bit single precision number the sign field is a single bit occupying the most significant bit position; the exponent field is an 8-bit quantity occupying the next-most significant bit positions; the fraction field occupies the least significant 23-bit positions. Similarly, in the case of a 64-bit double precision number the sign field is a single bit, the exponent field is 11 bits, and the fraction field is 52 bits. Additional formats provide the same information, but with varied field widths, with larger field widths providing the potential for greater accuracy and value range.
After each floating point result is developed, it must be normalized and then rounded. When the result is normalized, the number of leading zeros in the fraction field is counted. This number is then subtracted from the exponent, and the fraction is shifted left until a “1” resides in the most significant bit position of the fraction field. Certain floating point answers cannot be normalized because the exponent is already at its lowest possible value and the most significant bit of the fraction field is not a “1.” This is a “subnormal number” with fewer significant digits than a normalized number.
In designing the hardware and logic for performing floating point arithmetic operations in conformance with this standard, it is necessary and desirable to incorporate certain additional indicator bits into the floating point hardware operations. These indicator bits are injected into the fraction field of the floating point number, and are used by the arithmetic control logic to indicate when certain conditions exist in the floating point operation. In non-subnormal (normalized) numbers, for example, an “implicit” bit (generally referred to as the “hidden bit”) is created by the arithmetic control logic when the exponent of the floating point number has a nonzero value. This “hidden bit” is not represented in the storage format, but is assumed. It is inserted at the time a floating point number is loaded into the arithmetic registers and occupies the most significant bit position of the fraction field of the number. During addition, a single “guard” bit is set by the floating point control logic during certain arithmetic operations, as an indicator of the loss of significant bits of the floating point number being processed. The guard bit is set when a right shift, required for normalization, shifts a bit from the right side of the fraction field capacity. The guard bit occupies a portion of the fraction field. Finally, a “sticky” bit is set in certain floating point arithmetic operations as an indicator that the floating point number has lost some significant bits.
These extra bits in the fraction field are used exclusively for rounding operations, after the result has been normalized. The guard bit is treated as if it is a part of the fraction and is shifted with the rest of the fraction during normalization and exponent alignment and is utilized by the arithmetic. The sticky bit is not shifted with the fraction, but is utilized by the arithmetic. It acts as a “catcher” for bits shifted off the right of the fraction; when a 1 is shifted off the right side of the fraction, the sticky bit will remain a 1 until normalization and rounding are finished.
There are typically four modes of rounding, as follows: (1.) round to nearest; (2.) round to positive infinity; (3.) round to negative infinity; and (4.) round to zero. Each of these may introduce error into the calculation.
Though this standard is widely used and is useful for many operations, this standard defines “precision” as the maximum number of digits available for the significand of the real number representation and does not define precision as the number of correct digits in a real number representation. Neither does this standard provide for the calculation and storage of error information and therefore permits propagation of error including the potential loss of all significant bits. These problems in the current standard can lead to substantial accumulated rounding error and catastrophic cancellation error. Cancellation occurs when closely similar values are subtracted, and it injects significant error without a corresponding indication of this error in the result.
Various authors have contributed to the standard or noted these significant problems, but the problem persists.
U.S. Pat. No. 3,037,701 to Sierra issued in 1962 establishes the basis for hardware to perform fixed word length floating point arithmetic including normalization, rounding, and zero conversion. The Sierra patent describes the potential for introducing error in floating point operations including total loss of useful information. No method is described for calculating or retaining error information of any type.
In 2010, in his book Handbook of Floating-Point Arithmetic, Muller et al. describe the state-of-the-art of the application of floating point including the ISO/IEC/IEEE 60599:2011 and describe error problems. They state, “Sometimes, even with a correctly implemented floating-point arithmetic, the result of a computation is far from what could be expected.”
In 1991, David Goldberg, in “What Every Computer Scientist Should Know About Floating-Point Arithmetic,” provides a detailed description and mathematical analysis of floating point error. This paper describes rounding error (p. 6), relative error and error units in the last place (Ulps) (p. 8), the use of guard digits (p. 9), and cancellation error types, both catastrophic and benign (p. 10). Recommended error mitigation is limited to extending precision (again defined as digits available for real number representation) requiring additional storage space for computational results (p. 17) and numerical error analysis of a given problem to determine the method of computation to minimize and limit the error introduced by the computation.
Thus, many authors have acknowledged the existence of these types of errors in the current standard for floating point operations. In response, numerous attempts to address these significant problems have been made.
In 2012 in the article “Floating-Point Numbers with Error Estimates,” Glauco Masotti describes adding a data structure to standard floating point format to contain statistical estimates of the accumulated floating point error. This technique increases required storage space, adds computation time, and does not provide bounds for the error.
In 2008 in “The Pitfalls of Verifying Floating-Point Computations,” David Monniaux presents the limitations on static program analysis to determine the expected error generated by code to perform a sequence of floating point operations. However, static error analysis is prone to error and relies on and assumes a lengthy and expensive algorithm error analysis to ensure that the algorithm will provide sufficiently accurate results.
In summary, the current state-of-the-art does not retain error information within the associated floating point data structure. At present, any retention of bounds on floating point error requires significantly more memory space and computation time (or correspondingly more hardware) to perform error interval computations.
Further, in the current standard, when two values are compared by subtraction in which cancellation occurs, program flow decisions based on this erroneous comparison can result in an incorrect decision. No validity of the resulting comparison is provided by the standard conventions.
Importantly, the standard provides no indication when the result of a computation no longer provides a sufficient number of significant digits.
Additionally, conversion from external to internal format or conversion between floating point formats may inject an error in the initial representation of a real number without recording that error.
Further, floating point values are converted to external representation without indication of loss of significant bits even if no significant bits remain in the output data.
Notably, current technology does not permit allowing programmers to specify the number of required retained significant digits.
Thus, the various methods provided by the current art for floating point error mitigation have unresolved problems. Accordingly, there is a need for an apparatus and method for calculating and retaining a bound on error during floating point operations.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.