1. Technical Field
The invention relates generally to arithmetic processing systems, and more particularly relates to look-up tables used to obtain function values, such as seed values for iterative refinement division or square root.
In an exemplary embodiment, the invention is used to provide table look-up of reciprocal seed values for a conventional multiplicative division implementation.
2. Related Art.
Floating point units (and other arithmetic processors) commonly use multiplier based algorithms for division. These division algorithms initially employ a seed reciprocal of the divisor.
The seed reciprocals have a selected number of bits of accuracy. Iterative multiplies are performed to iteratively increase the accuracy of the reciprocal approximation until a final quotient value of predetermined accuracy can be obtained.
The seed reciprocals are typically obtained from a ROM reciprocal look-up table, or equivalent PLA (programmed logic array). The number of table input index bits and table output bits of the seed reciprocals determines the size of the look-up table --more input bits allowing more bits of accuracy in the seed reciprocals reduces the necessary number of iterative multiply cycles, reducing division time albeit at the cost of exponential growth in the reciprocal table size.
Without limiting the scope of the invention, this background information is provided in the context of a specific problem to which the invention has application: for a floating point unit design, achieving reciprocal table compression to improve the design trade-off between division time (i.e., the number of necessary interative multiply cycles) and reciprocal table size. A collateral design problem is establishing, for a desired accuracy in reciprocal table output, what is the minimum table size required.
Current day designs of IEEE standard floating point units for PC's and workstations generally have substantial design effort and chip area devoted to providing a multiplier with at most a couple of cycles latency. In addition to special function computations, this multiplier resource is typically exploited to obtain faster division. Moreover, if multiplicative division is used, it is common to also use a multiplicative square root algorithm.
Newton-Raphson, convergence, prescaled, and short reciprocal division are multiplier based iterative division algorithms that have been employed in recent floating point unit implementations. These multiplicative division algorithms each provide speedups by factors of 2 to 4 over the traditional shift-and-subtract iterative division algorithms such as non-restoring and SRT. However, for these multiplicative division algorithms, note that division time relates to multiplier latency, not throughput, because pipelining through the multiplier cannot be employed for the iterative dependent multiplications required for reciprocal refinement.
Each of the multiplicative division algorithms initially employs a seed (approximate) reciprocal of the divisor obtained (such as by table look-up) with a certain precision measured by the number of bits of accuracy. In general, the precision of an approximate reciprocal of an input argument measured in bits of accuracy is given by the negative base 2 logarithm of the relative error between the approximate reciprocal and the infinitely precise reciprocal of the input argument.
To illustrate the affect of the accuracy of the seed reciprocal on the division time of the multiplicative division algorithm, consider Newton-Raphson division, which uses a reciprocal refinement algorithm that converges quadratically. Initially a seed (approximate) reciprocal of the divisor is obtained with a certain precision measured by the number of bits of accuracy. Iterative multiplier operations are used to compute successive refined reciprocal approximations of the divisor to a desired precision. Specifically, each Newton-Raphson iteration effectively doubles the precision of the reciprocal approximation, i.e. doubles the number of accurate bits. Thus, the precision of the seed reciprocal directly determines the number of such iterations to obtain a final reciprocal approximation with a desired precision.
The final reciprocal approximation is then multiplied by the dividend to produce an approximate quotient. The quotient is multiplied by the divisor and subtracted from the dividend to obtain a corresponding remainder. Note that no part of the quotient is generated until the multiplication of the dividend by the refined reciprocal in the last step.
The Newton-Raphson division algorithm for Q=N/D is:
1.Initialize: x.sub.0 .apprxeq.1/D PA1 2.Iterate: x.sub.i+1 =x.sub.i .times.(2-D.times.X.sub.i) PA1 3.Final: Q=X.sub.last .times.N R=N-(Q.times.D) PA1 D.times.X.sub.i =1+.epsilon.such that
X.sub.i .apprxeq.1/D can be written as x.sub.i .apprxeq.(1/D)(1+.epsilon.)
where .epsilon. is the relative error in the approximation (assumed to be small). Thus, in the iterative step:
2-D.times.X.sub.i =1-.epsilon.So that, for the next iteration ##EQU1## That is, for each iteration, the relative error as measured by the number of accurate bits doubles.
Consider the application of this algorithm to obtain the double precision (53 bits) quotient Q where N and D are double precision dividend (numerator) and divisor respectively. N and D are normalized (1.ltoreq.N, D&lt;2), such that the reciprocal of D must fall in the interval (1/2, 1!, where "()" indicate exclusive bounds and "!" indicate inclusive bounds.
If a single value of the seed reciprocal X.sub.0 .apprxeq.1/D is used for any 1.ltoreq.D&lt;2 then X.sub.0 .apprxeq.2/3 is the most accurate seed reciprocal, accurate to about 1.585 bits. With X.sub.0 =2/3, the Newton-Raphson division requires 6 iterations (12 multiplications) to attain the desired number of bits of accuracy for a double precision quotient (1.53612244853+).
Notice that it takes three iterations (6 multiplications) to increase the accuracy to over 7 bits. These initial iterations can be conveniently replaced by a look-up table to provide a seed reciprocal accurate to 7 bits, specifically requiring 7 leading bits of D for table look-up to provide a seed reciprocal X.sub.0 accurate to 7.484 bits. With this small reciprocal table having 128 entries, the algorithm requires only 3 iterations (6 multiplications) for a double precision quotient (7142853+). Thus, using a seed reciprocal table of 2.sup.7 .times.7 bits=896 bits, the number of multiplications is cut in half.
More bits of accuracy in the seed reciprocal further reduces the number of necessary multiply cycles. Consider the seed reciprocal X.sub.0 to be accurate to 14 bits. Then the above algorithm requires only 2 iterations (4 multiply cycles) for a double precision quotient (142853+). However at least 14 leading bits of D are needed for input to a conventional reciprocal table to provide a seed reciprocal accurate to 14 bits. Such a table requires 2.sup.14 .times.14 bits=230 Kbits a size that is prohibitive for current technology.
The limitation of conventional reciprocal tables is that increasing the accuracy of the seed reciprocal by one bit results in more than doubling the reciprocal table size. Because there are no obvious efficient techniques for improving the rate of convergence beyond the current quadratically converging multiplicative division algorithms, such as Newton-Raphson, the critical design trade-off is between table size (and therefore area) and division cycle time.
Table compression can be obtained by applying conventional interpolation techniques to the table output. However, interpolation has the disadvantage of requiring the added cost of a multiplication and/or addition to effect the interpolation Fa 81, Fe 67,Na 87!.
A collateral issue to table size is to specifically define the accuracy that can be obtained from a table of a given size --stated another way, for a desired accuracy of the seed reciprocal, the design problem is to determine what is the minimum table size. For current reciprocal table designs, rather than pursue the exhaustive investigation of minimum table size at the bit level, the design approach has often been to employ oversized tables.
The proper accuracy measure of a reciprocal table to be optimized depends on the division algorithm being implemented and the size and architecture of the multiplier employed. In general, two principal accuracy measures have been used for reciprocal tables: precision and units in the last place (ulps). In particular, if table output is guaranteed accurate to one ulp for all inputs, then the table is termed faithful. A third approach to measuring the accuracy of a reciprocal table is the percentage of inputs that yield round-to-nearest output.
Reciprocal tables are typically constructed by assuming that the argument is normalized 1.ltoreq.X &lt;2 and truncated to k bits to the right of the radix point --1.b.sub.1 b.sub.2. . . b.sub.k. These k bits are used to index a reciprocal table providing m output bits which are taken as the m bits after the leading bit in the m+1 bit fraction reciprocal approximation --0.1b.sub.2' b.sub.3'. . . b.sub.m+1 '. Such a table will be termed a k-bits-in m-bits-out reciprocal table of size 2.sup.k' .times.m bits.
Regarding the precision measure of table accuracy, the maximum relative error for any k-bits-in m-bits-out reciprocal table denotes the supremum of the relative errors obtained between 1/x and the table value for the reciprocal of x for 1.ltoreq..times.&lt;2. The precision in bits of the table is the negative base two logarithm of this supremum. A table precision of .alpha.bits (with a not necessarily an integer) then simply denotes that the approximation of 1/x by the table value will always yield a relative error of at most 1/2.sup..alpha.. For Newton-Raphson (and other convergence division algorithms), the precision of the table determines the number of dependent (i.e., non-pipelined) multiplications to obtain the quotient of desired accuracy.
The following Table gives the precision in bits of the k-bits-in m-bits-out reciprocal table for the most useful cases 3.ltoreq.k, m.ltoreq.12, facilitating evaluating tradeoffs between table size and the number of reciprocal refinement iterations to achieve a desired final precision. This Table appears in DM 94!.
__________________________________________________________________________ bits in/ 3 4 5 6 7 8 9 10 11 12 __________________________________________________________________________ bits out 3 3.540 4.000 4.000 4.000 4.081 4.081 4.081 4.081 4.087 5.087 4 4.000 4.678 4.752 5.000 5.000 5.000 5.042 5.042 5.042 5.042 5 4.000 4.752 5.573 5.850 5.891 6.000 6.000 6.000 6.022 6.022 6 4.000 5.000 5.850 6.476 6.790 6.907 6.950 7.000 7.000 7.000 7 4.081 5.000 5.891 6.790 7.484 7.775 7.888 7.948 7.976 8.000 8 4.081 5.000 6.000 6.907 7.775 8.453 8.719 8.886 8.944 8.974 9 4.081 5.042 6.000 6.950 7.888 8.719 9.430 9.725 9.852 9.942 10 4.081 5.042 6.000 7.000 7.948 8.886 9.725 10.443 10.693 10.858 11 4.087 5.042 6.022 7.000 7.976 8.944 9.582 10.693 11.429 11.701 12 4.087 5.042 6.022 7.000 8.000 8.974 9.942 10.858 11.701 12.428 __________________________________________________________________________
Regarding the faithfulness measure of table accuracy, reciprocal table output is faithful if it is accurate to one ulp (unit in the last place), i.e. the table output always has less than one ulp deviation from the infinitely precise reciprocal of the infinitely precise input argument. The general measure of accuracy is the determination of the worst case error in ulps --although a sufficiently large number of input guard bits allows a worst case error bound approaching one half ulp, the useful and computationally tractable threshold of one ulp accuracy is a conventional standard for transcendental functions where infinitely precise evaluation is not tractable.
Regarding the faithfulness measure of accuracy, for both the prescale and short reciprocal division algorithms, the size (length in bits) of the reciprocal affects the size of the circuitry employing the reciprocal BM 93, EL 94, Na 87, WF 91!. Many compelling arguments can be made in favor of providing that the final results of function approximation should both (a) satisfy a one ulp bound (faithfulness), and (b) uniformly attempt to maximize the percentage of input arguments that are rounded to nearest AC 86, BM 93, FB 91, Ta 89, Ta 90, Ta 91!. One approach is to have the table result itself be the round-to-nearest value of the infinitely precise reciprocal, providing a useful metric for those platforms where a reciprocal instruction is provided in hardware. This requires that the table input be the full argument precision, which is currently prohibitive in table size even for single precision arguments (23 bits).
A robust reciprocal table construction algorithm that is appropriately optimal for each of the two principal accuracy measures, precision and faithfulness (ulp), is the midpoint reciprocal algorithm described in DM 94!. The midpoint reciprocal methodology generates tables such that the relative error for each table entry is minimized, thereby uniformly maximizing table output precision. This table design methodology further generates minimum sized tables to guarantee faithful reciprocals for each table entry, and for faithful tables maximizes the percentage of input values obtaining round-to-nearest output.
The midpoint reciprocal design methodology generates tables that have maximum table precision. For such k-bits-in m-bits-out tables, the design methodology generates a k-bits-in, k-bits-out table with precision at least k+0.415 bits for any k, and more generally with g guard bits that for the m=(k+g)-bits-out table the precision is at least k+1 -log.sub.2 (1+1/2.sup.g+1) for any k. To determine extreme-case test data, and to compute the precision of a reciprocal table without prior construction of the full reciprocal table, the midpoint reciprocal design methodology only requires generation and inspection of a small portion of such a table to identify input values guaranteed to include the worst case relative errors in the table.
The precision and faithfulness (ulp) measures of lookup table quality, and the midpoint reciprocal algorithm for generating optimal conventional lookup tables regarding these metrics, establish a benchmark for the size and accuracy of conventional tables. This benchmark can be used in assessing the quality of any table compression methodology in terms of accuracy versus table size.