The present invention generally relates to performing floating point operations in a floating point unit of a microprocessor, and more specifically to a method for computing correctly rounded function values f(x) of double-precision floating-point numbers within the floating point unit of a microprocessor.
Much software is written using numbers which are represented in a floating-point arithmetic format. FIG. 1 illustrates a format 10 for representing real numbers defined by the standard referred to as IEEE 754 (1984). The representation 10 includes a sign bit 11, an 11-bit exponent field 12, and a 52-bit mantissa field 13. Let s, e and n respectively represent the values of the sign, exponent, and mantissa fields where each of these values are treated as unsigned binary numbers. The format 10 represents a number if e is not 2047. If the exponent e is equal to 2047, the format represents an infinity or a Not-a-Number (NaN) as defined by the IEEE standard. If e and n are both 0, the number represented is the value 0. If e is 0 and n is not zero, the number represented is (xe2x88x921)s2xe2x88x921074 n. If e is not zero, the number represented is (xe2x88x921)s2exe2x88x921023 (1+2xe2x88x9252n).
While many real numbers can be represented using the format 10, many arithmetic operations on two representable numbers can produce a result which is not any of the representable numbers (i.e., the number of bits in the result exceed the number of bits in the double precision format of FIG. 1). For example, the fraction ⅓ cannot be represented to infinite accuracy in the finite number of bits present in the format of FIG. 1. In such cases, a floating-point arithmetic operation usually returns a representable number which differs from the exact result by performing a rounding operation so that the result may be represented in the finite-sized format of FIG. 1. The IEEE 754 (1984) standard allows the use of four rounding modes (round to nearest, round toward zero, round toward plus infinity and round toward minus infinity) and desires that a standard result to be obtained in rounding each possible exact real number to a representable real number for the operations of addition, subtraction, multiplication, division and square root. Most computer processor manufacturers implement the standard for double precision, and as a result, arithmetic results from these operations should be identical across a wide range of software platforms, offering interoperability. In addition, in round-to-nearest mode, the rounding produces the representable number closest to the exact result, maintaining a high degree of accuracy. For the operations of addition, subtraction, multiplication, division and square root, the ability to obtain correctly rounded results offers both portability and accuracy.
There is at present no standard ensuring portability and accuracy for the evaluation of properly rounded transcendental functions of numbers which are represented in double precision format. Nobody has been able to determine for a given function f(x) how accurately an approximate value must be computed so that it is possible to determine how to round it correctly in all double precision cases for that function. There are too many double precision numbers to be able to check each one individually by sequential brute force computation, whereas it is possible to check all single precision arguments by brute force computation to ensure proper rounding in all cases for all f(x). For example, to process all double precision results f(x) in a brute force manner to enable the fixing of the precision of an algorithm to an optimal point that ensure correct rounding with optimal performance for all double precision operands, x, one would need to perform about 263 evaluations. Using current computing power, these computations would take 292,000 years per each f(x) implemented by a CPU architecture assuming an optimistic rate of 1 f(x) evaluation per microsecond. Clearly, there is no viable method to brute force process all double precision numbers, using today""s technology, for all f(x) for a CPU to ensure algorithm optimization that rounds correctly for all x processed by f(x).
To at least guarantee portability, the Java language requires that certain functions return values computed using certain algorithms that are known to return incorrectly rounded results for a large number of arguments. This requirement virtually requires that a compliant implementation of the function use exactly that algorithm, since there is no other practical way to get the specific wrong values on each of the arguments that returns a wrong value. The problem of determining the set of arguments for which it is necessary to get the wrong value is essentially as hard as the problem of finding how much accuracy is needed in order to ensure proper rounding in all cases.
Interval arithmetic is a form of calculation in which the operands of a function are typically intervals between representable numbers and the result of a function is an interval between representable numbers which is guaranteed to be large enough to contain the exact result if each operand is independently drawn from anywhere inside its interval. A long series of interval arithmetic operations usually results in continually growing intervals, as uncertainties in one result lead to even larger uncertainties in results based on them. The interval sizes grow more rapidly than necessary if functions cannot be rounded correctly, since the resulting interval must be large enough to also include the possibility that rounding was performed incorrectly. Correct rounding is therefore important to interval arithmetic.
Shmuel Gal and Boris Bachelis teach a method which is more likely to return a correctly rounded value than other prior art methods that do not address the rounding problem. FIG. 2 illustrates the fundamental principle taught by Gal and Bachelis. It is well known that all but a few arguments of one of the transcendental functions will have a function value which is not representable exactly in a double precision format. In the Gal and Bachelis method, a function value is computed as the sum of two parts, such that the sum carries more precision than is available in a single double precision number. FIG. 2 illustrates the Gal and Bachelis method for the case of a round-to-nearest mode of operation. Almost all computed function values will lie between a representable number 20 and the next higher representable number 21. The real number 22 (not representable in the format of FIG. 1 since it requires an additional xc2xd least significant bit (LSB)) lying halfway between the representable numbers 20 and 21 is the critical number for rounding properly in the round-to-nearest mode. If the computed function value f(x) for a value x lies above the line 22, f(x) is rounded to the representable number 21, whereas if the computed f(x) lies below the line, f(x) is rounded to the representable number 20 for the round-to-nearest mode. Each computed function value (e.g., 23 or 24 in FIG. 2) has an error associated with it, and FIG. 2 illustrates a number of error bars for various function values f(x) including the specific values labeled 23 and 24 in FIG. 2. The central value of each error bar is the computed value and the error bars indicate a guaranteed bound on the error of the computation that is a function of the precision of the f(x) computational algorithm used. FIG. 2 does not mean to suggest that all values f(x) lie between the same two representable binary numbers. Rather, FIG. 2 illustrates the superposition of each function value f(x) between the two different representable numbers which surround it.
Gal and Bachelis teach that if the entire error bar lies above or below the line 22, then the rounding operation described above is guaranteed to be correct. As an example, the error bar specifically labeled as bar 23 in FIG. 2, will give the correctly rounded function value 21 since the error bar 23 does not overlie or cross over the midpoint 22. However, the error bar 24, which crosses the line 22 in FIG. 2 might give an incorrectly rounded function, since if the exact value were truly below the line 22 yet still within the error bar, the result would be incorrectly rounded up since the computed value is above the line 22 in FIG. 2 (opposite to the true location of the exact value). In essence, the presence of the computation close to the midpoint or rounding point 22 coupled with a finite range of possible error in the precision of the computation could result in improper rounding. Therefore, if the error bar is detected to be crossing the critical rounding line 22, a more accurate computation algorithm must be used in order to attempt to get accurate rounding. Gal and Bachelis describe computing a more accurate approximation in the case where the error bar likely crosses the rounding point 22 which is likely, but not guaranteed, to give a correctly rounded answer.
While Gal and Bachelis can reduce the probability of getting an incorrectly rounded f(x), it is never known in advance how many bits of precision will ultimately be needed to get the correctly rounded value of f(x) for all x within the entire double precision range(e.g., Gal and Bachelis cannot determine for each f(x) whether 10, 100, 1000 or more bits of precision are truly needed in order to resolve the incorrect rounding problem for each f(x)). The probability of an incorrect result is decreased by Gal and Bachelis by arbitrarily adding some number of extra bits of precision which may or may not actually fix the incorrect rounding problem. The probability of an incorrect result rounding is decreased by roughly a factor of xc2xd for each additional bit of precision added to the algorithm. Thus reducing the probability of incorrect rounding by a factor of 1 billion would require approximately 30 extra bits of precision. There are two difficulties which result from the arbitrary precision extension approach of Gal and Bachelis.
First, the Gal and Bachelis method usually requires many additional bits beyond those actually needed to ensure correct rounding. In other words, Gal and Bachelis may predict that a rounding operation is too close to the rounding point for current precision and increase the algorithm""s precision by 5x where a simple increase in precision of a few bits would have sufficed to solve the rounding problem. This overcompensation of precision is needed just to reduce the probability that they guessed wrong how many bits were actually needed for the second, more accurate computation. Therefore, when Gal and Bachelis do result in correct rounding, they may have wasted significant computing power in overkill operations in order to get to that result.
Second, even after adding many additional bits to the precision of the algorithm, there is a remaining probability that not enough bits were added whereby incorrect rounding results in spite of their efforts to avoid such an outcome. Such incorrect rounding would occur even though Gal and Bachelis expended significant resources in an attempt to avoid that outcome. The first difficulty discussed above means that either a hardware or software implementation will likely be more expensive than it needs to be to obtain proper rounding, while the second difficulty discussed above means that their xe2x80x9csolutionxe2x80x9d may not be sufficient to get the correct result for all arguments x in the double precision range. In effect, the Gal and Bachelis solution is guesswork and not guaranteed to produce correct rounding with the minimal possible overhead.
Similar arguments can be made for the Gal and Bachelis solution when they are using the other rounding modes (e.g., round to plus infinity, round to zero, etc). Or different rounding modes, the only difference is that the rounding line would then be at a representable number, like 20 or 21, instead of the midway xc2xd LSB point 22.
Therefore, prior art is likely to do too much work and still have a chance of getting the wrong answer, whereby a need exists in the industry for a method that guarantees that each floating point result f(x) for all x in a double precision range will always round properly without undue overhead or overcompensation.