1. Field of the Invention
The preferred embodiments of the present invention are directed to microprocessors. More particularly, the preferred embodiments of the present invention are directed to floating point dividers in microprocessors.
2. Background of the Invention
Most modern microprocessors have the ability to perform floating point operations in hardware. Particularly difficult among these floating point operations are division and square root operations. There are two commonly used techniques to perform the division operation: the Sweeny-Robertson-Tocher (SRT) technique, and the Newton-Raphson technique. Newton-Raphson operations have the attractive property of a quadratic convergent operation, meaning that the quotient result doubles precision after each iteration. The SRT algorithm is a non-restoring method which iteratively subtracts divisor multiples from a working partial remainder to determine the quotient, with a fixed number of quotients digits determined in each iteration. This specification addresses floating point division using SRT type techniques.
Floating point numbers in computers are generally represented in scientific notation. In more precise mathematical terms, the operands of a binary floating-point division are a dividend, represented by (−1)Sa×Fa×2Ea-Ebias, a divisor given by (−1)Sb×Fb×2Eb-Ebias, which produces a quotient given by (−1)Sr×Fr×2Er-Ebias, where S is the sign bit, F is the mantissa (for 1≦F<2), and E is the biased exponent assuming an Institute of Electrical and Electronic Engineers (IEEE) 754 standard normalized numbers. In order to accommodate both positive and negative exponentials without requiring a sign bit for the exponential, each of the exponential numbers is biased based on the precision of the variable. For single precision numbers, the IEEE 754 standard requires an Ebias=127, for double precision numbers, an Ebias=1,023.
In broad terms, performing a floating-point division operation comprises the following steps:
TABLE 1StepOperation1Er = Ea − Eb + Ebias (find the power of the quotient)2Sr = Sa □ Sb (exclusive-OR the sign bits of each mantissa todetermine the sign of the division operation)3Fr = Fa ÷ Fb (In SRT, performed by iterative subtraction ofmultiples of Fb from a working partial remainder which is thenshifted up. The working partial remainder initialized with Fa.)4Rounding of the fractional result to the most significant digit.5Normalize Fr (that is, 1 ≦ Fr < 2)6Detect underflow (if Er is less than one) or overflow (if Er is lessthan 1,023 (for double precision) or 127 (for single precision))While all these steps are required for each floating-point division operation, it is step three that is the primary concern of this specification.
The SRT type algorithms iteratively subtract divisor multiples from a partial remainder to determine the quotient with a fixed number of quotient bits determined in each iteration or cell, according to the following equations:Pi+1=R(Pi−Qi+1D)  (1)Pi*=trunc (Pi)  (2)where Pi+1 is the calculated partial remainder within the cell (in carry-save form), Pi is the partial remainder from the previous iteration (with the initial partial remainder set equal to the dividend), R is the radix of a quotient digit, Qi+1 is the quotient digit at iteration i+1, D is the divisor, and Pi* is an estimated partial remainder where Pi*≦Pi.
In the SRT type algorithms, the precision of the quotient is dependent upon the number of iterations performed (the number of subtraction division cells the calculation propagates through) and the radix of the calculation. In a Radix-4 calculation, two quotient bits are calculated with each cell or stage. With Radix-8, three quotient bits are calculated with each cell, and so on. The more cells that can complete within a given time, the more exact the quotient can be calculated, or the less time it requires to take the calculation to full precision. Thus, speed of the cells is of paramount importance.
Related art SRT type algorithms implement time saving features in an attempt to speed the calculation time through each cell. For example, the partial remainders Pi are in carry-save form and thus may use more efficient carry-save adders to perform the subtraction required of equation (1) (as opposed to fall carry-propagation adders). Further, the multiplication by the radix R (where the radix is a multiple of two—2, 4, 8, 16, etc.) is a simple shift operation in binary systems. Moreover, multiplication of the quotient digit Q times the divisor D is either a simple shift operation (for powers of two), and/or is calculated simultaneously with prescaling of the dividend and a divisor prior to the parameters entering the cells.
A major limitation of related art SRT type cells however is determining the quotient digit (in a Radix-4 system, each quotient digit reveals two quotient bits) from the calculated partial remainder. Understanding this limitation of related art SRT type cells however requires a better understanding of the cells themselves. In defining the parameters of a subtractive division cell, it was mentioned that the partial remainder Pi is in carry-save form. Subtraction of the quotient multiples (the Qi+1 D portion of equation (1)) from the partial remainder Pi therefore takes place with carry-save adders, which are faster than carry-propagate adders. Consider for purposes of explanation the addition of two binary numbers:
TABLE 2AUGEND1010101ADDEND0001101Carry Propagate Add Result1100010Carry-Save Add Sum1011000Carry-Save Add Carry0000101Verification Add of Carry-Save Result1100010The full carry-propagate add produces the result 1100010. However, calculating this result requires the carry from each bit addition to propagate to the next stage before the final result is achieved. If each adder requires two gate delays to complete, the time it takes to complete the full carry-propagate add of the exemplary set of numbers in Table 2 is at least 14 gate delays. The carry-save result, however, in not combining the carries from each bit addition and thus leaving the result in the redundant carry-save form, may be completed in only two gate delays, regardless of the number of bits. Considering that the mantisa of each floating point number may be 50 bits long or more, it is easily seen why the related art SRT cells use this addition method and number form. A verification that the carry-save form is equal to the results of the carry-propagate addition is included in Table 2, which is simply the full carry-propagate add of the sum and carry results. Thus, in SRT cells, partial remainders are calculated using carry-save adders, leaving the resultant in the carry-save form.
A quotient digit determined by each subtractive division cell is selected based on the value of the partial remainder calculated by that cell. The most accurate selection of the quotient digit would be based on the complete partial remainder calculated by the cell, but calculating the complete partial remainder requires a full carry-propagate add of the carry-save resultant, thus negating its benefits. Rather than perform the full carry-propagate add, SRT type cells rely upon an estimated partial remainder Pi* as indicated in equation (2). In this way, only a portion of the partial remainder calculated by the SRT type cell needs to be converted from carry-save form.
To obtain the estimated partial remainder, related art SRT type cells perform a full carry-propagate add of the most significant bits of the calculated partial remainder. U.S. Pat. No. 5,954,789 to Yu et al. (hereinafter the '789 patent) describes in the Background section that conventional wisdom prescribes at least four bits of the calculated partial remainder should be used to determine the next quotient digit for Radix-2 SRT type systems. Similarly, related art Radix-4 implementations need six bits of the partial remainder Pi to determine the next quotient digit. The related art devices apply the estimated partial remainder to a look-up table, or a hard-coded set of logic, to predict the quotient digit. The '789 patent exemplifies just such a look-up table at Table I (spanning columns 4 and 5) for the Radix-2 implementation therein. Thus, related art SRT type cells perform the carry-save add to obtain the partial remainder in carry-save form, perform a full carry-propagate add of the most significant six bits of the partial remainder (for a Radix-4 system) to obtain the estimated partial remainder, and finally apply the estimated partial remainder to a look-up table or hard-coded decision tree to obtain the next quotient digit. Since divisor multiples are calculated in advance, and the subtraction of the divisor multiples from the partial remainder can be accomplished in approximately two gate delays, the limitation of the speed of SRT type cells is calculating the estimated partial remainder (six bit full carry-propagate add for Radix-4 implementations) and applying the estimated partial remainder to the look-up table to determine the next quotient digit.
What is needed in the art is a way to increase the speed of determining the quotient digits of an SRT type floating-point division cell.