1. Field of the Invention
The present invention relates to the field of SRT division and square root mantissa units suitable for use in floating point units of microprocessors. Specifically, the present invention relates to SRT hardware dividers and square root units that produce multiple quotient digits per clock cycle.
2. Discussion of the Related Art
The SRT algorithm provides one way of performing non-restoring division. See, J. E. Robertson, "A new class of digital division methods," IEEE Trans. Comput., vol. C-7, pp. 218-222, September 1958, and K. D. Tocher, "Techniques of multiplication and division for automatic binary computers," Quart. J. Mech. Appl. Math., vol. 11, pt. 3, pp. 364-384, 1958. Digital division takes a divisor and a dividend as operands and generates a quotient as output. The quotient digits are calculated iteratively, producing the most significant quotient digits first. In SRT division, unlike other division algorithms, each successive quotient digit is formulated based only on a few of the most significant partial remainder digits, rather than by looking at the entire partial remainder, which may have a very large number of digits. Since it is not possible to insure correct quotient digit selection without considering the entire partial remainder in any given iteration, the SRT algorithm occasionally produces incorrect quotient digit results. However, the SRT algorithm provides positive, zero, and negative quotient digit possibilities. If the quotient digit in one iteration is overestimated, then that error is corrected the next iteration by selecting a negative quotient digit. In SRT division, quotient digits must never be underestimated; quotient digits must always be overestimated or correctly estimated. By never underestimating any quotient digits, the partial remainder is kept within prescribed bounds so as to allow the correct final quotient to be computed. Because the SRT algorithm allows negative quotient digits, the computation of the final quotient output usually involves weighted adding and subtracting of the quotient digits, rather than merely concatenating all the quotient digits as in normal division.
The higher the radix the more digits of quotient developed per iteration but at a cost of greater complexity. A radix-2 implementation produces one digit per iteration; whereas a radix-4 implementation produces two digits per iteration. FIG. 1 illustrates a simple SRT radix-2 floating point implementation. The simple SRT radix-2 floating point implementation shown in FIG. 1 requires that the divisor and dividend both be positive and normalized; therefore, 1/2.ltoreq.D, Dividend&lt;1. The initial shifted partial remainder, 2PR0!, is the dividend. Before beginning the first quotient digit calculation iteration, the dividend is loaded into the partial remainder register 100; thus, the initial partial remainder is the dividend. Subsequently, the partial remainders produced by iteration are developed according to the following equation. EQU PR.sub.i+1 =2PR.sub.i -q.sub.i+1 D (1)
In Equation 1, q.sub.i+1 is the quotient digit, and has possible values of -1, 0, or +1. This quotient digit q.sub.i+1 is solely determined by the value of the previous partial remainder and is independent of the divisor. The quotient selection logic 102 takes only the most significant four bits of the partial remainder as input, and produces the quotient digit. In division calculations, the divisor remains constant throughout all iterations. However, square root calculations typically involve adjustments to the divisor stored in the divisor register 101 after each iteration. Therefore, the independence of the quotient digit selection on the divisor is an attractive feature for square root calculations.
The partial remainder is typically kept in redundant carry save form so that calculations of the next partial remainder can be performed by carry-save adders instead of slower and larger carry-propagate adders. The partial remainder is converted into non-redundant form after all iterations have been performed and the desired precision has been reached. Because the SRT algorithm allows overestimation of quotient digits resulting in a negative subsequent partial remainder, it is possible that the last quotient digit is overestimated, so that the final partial remainder is negative. In that case, since it is impossible to correct for the overestimation, it is necessary to maintain Q and Q-1, so that if the final partial remainder is negative, Q-1 is selected instead of Q. The quotient digits are normally also kept in redundant form and converted to non-redundant form at the end of all iterations. Alternatively, the quotient and quotient minus one (Q and Q-1) can be generated on the fly according to rules developed in M. D. Ercegovac and T. Lang, "On-the-fly rounding," IEEE Trans. Comput., vol. 41, no. 12, pp. 1497-1503, December 1992.
The SRT algorithm has been extended to square root calculations allowing the utilization of existing division hardware. The simplified square root equation looks surprisingly similar to that of division. See, M. D. Ercegovac and T. Lang, "Radix-4 square root without initial PLA," IEEE Trans. Comput., vol. 39, no. 8, pp. 1016-1024, August 1990. The iteration equation for square root calculations is as follows. EQU PR.sub.i+1 =2PR.sub.i -q.sub.i+1 (2Q.sub.i +q.sub.i+1 2.sup.-(i+1))(2)
In Equation 2, the terms in parentheses are the effective divisor. For square root calculations, the so-called divisor is a function of Q.sub.i, which is a function of all the previous root digits q.sub.1 through q.sub.i. The root digits will be referred to as "quotient digits" to maintain consistency in terminology. Therefore, in order to support square root calculation using the same hardware as used for division, on-the-fly quotient generation is required in order to update the divisor after each iteration.
Binary division algorithms are analogous to standard base 10 long division which is taught in grammar school. In R.div.D=Q, each quotient digit for Q is guessed. In order to determine the first quotient digit, a guess for the proper quotient digit is multiplied by the divisor, and that product is subtracted from the dividend to produce a remainder. If the remainder is greater than the divisor, the guess for the quotient digit was too small; if the remainder is negative, the guess for the quotient digit was too large. In either case, when the guess for the quotient digit is incorrect, the guess must be changed so that the correct quotient digit is derived before proceeding to the next digit. The quotient digit is correct when the following relation is true: 0.ltoreq.PR&lt;D, in which PR stands for the partial remainder after subtraction of the quotient digit multiplied by the divisor.
The key to the SRT division algorithm is that negative quotient digits are permitted. For example, in base 10, in addition to the standard digits 0 through 9, quotient digits may take on values of -1 through -9. Consider the division operation 600.div.40. If the correct quotient digits are selected for each iteration, the correct result is 15. However, assume for the moment that during the first iteration, a quotient digit of 2 was incorrectly guessed instead of the correct digit of 1. The partial remainder after 2 has been selected as the first quotient digit is 600-(2*40*10.sup.1)=-200. According to SRT division, this error can be corrected in subsequent iterations, rather than having to back up and perform the first iteration again. According to SRT division, assume that the second quotient digit is correctly guessed to be -5. The partial remainder after that iteration will be -200-(-5*40*10.sup.0)=0. When the partial remainder after an iteration is zero, the correct values for all the remaining digits are zeros. Thus, the computed result is 2*10.sup.1 +-5*10.sup.0 =15, which is the correct result. The SRT algorithm thus allows an overestimation of any given quotient digit to be corrected by the subsequent selection of one or more negative quotient digits. It is worth noting that the estimated quotient digit must not be more than one greater than the correct quotient digit in order to subsequently reduce the partial remainder to zero, thus computing the correct result. If errors greater than positive one were allowed in estimating quotient digits, then quotient digits less than -9 (for example -10, -11, etc.) would be required in base 10. Similarly, since the range of quotient digits is not expanded in the positive direction at all according to the SRT algorithm, underestimation of the correct quotient digit is fatal, because the resulting partial remainder will be greater than the divisor multiplied by the base, and a subsequent quotient digit higher than 9 (for example 10, 11, etc.) in base 10 would be required. Therefore, in order to keep the partial remainder within prescribed bounds, the quotient digit selection must never underestimate the correct quotient digit, and if it overestimates the quotient digit, it must do so by no more than one.
It is possible to guarantee that the above criteria for keeping the partial remainder within prescribed bounds will be satisfied without considering all the partial remainder digits. only a few of the most significant digits of the partial remainder must be considered in order to choose a quotient digit which will allow the correct result to be computed. SRT division requires a final addition after all quotient digits have been selected to reduce the redundant quotient representation into standard non-redundant form having only nqn-negative digits. In binary (base 2) which is utilized in modern electrical computation circuits, SRT division provides quotient digits of +1, 0, or -1. The logic 102 which generates quotient selection digits is the central element of an SRT division implementation.
Early research indicated that only the most significant three bits of redundant partial remainder are necessary inputs for a radix-2 quotient digit selection function. (See, S. Majerski, "Square root algorithms for high-speed digital circuits," Proc. Sixth IEEE Symp. Comput. Arithmetic., pp. 99-102, 1983; and D. Zuras and W. McAllister, "Balanced delay trees and combinatorial division in VLSI," IEEE J. Solid-State Circuits., vol. SC-21, no. 5, pp. 814-819, October 1986.) However more recent studies have shown that four bits are required to correctly generate quotient digit selection digits and keep the partial remainder within prescribed bounds. (See M. D. Ercegovac and T. Lang, Division and Square Root: Digit-recurrence Algorithms and Implementations, Kluwer Academic Publishers, 1994, ch. 3; S. Majerski, "Square-rooting algorithms for high-speed digital circuits," IEEE Trans. Comput., vol. C-34, no. 8, pp. 724-733, August 1985; P. Montuschi and L. Ciminiera, "Simple radix 2 division and square root with skipping of some addition Steps," Proc. Tenth IEEE Symp Comput. Arithmetic. pp. 202-209, 1991; and V. Peng, S. Samudrala, and M. Gavrielov, "On the implementation of shifters, multipliers, and dividers in floating point units," Proc. Eighth IEEE Symp. Comput. Arithmetic, pp. 95-101, 1987. The selection rules according to the prior art can be expressed as in the following equations in which PR represents the most significant four bits of the actual partial remainder, and in which the decimal point appears between the third and fourth most significant digits. The partial remainder is in two's complement form, so that the first bit is the sign bit.
q.sub.i+1 =1, if 0.ltoreq.2PR.ltoreq.3/2, (3A) PA1 q.sub.i+1 =0, if 2PR=-1/2, (3B) PA1 q.sub.i+1 =-1, if -5/2.ltoreq.2PR.ltoreq.-1. (3C)
Because the partial remainder is stored in register 100 in carry-save form, the actual most significant four bits are not available without performing a full carry propagate addition of the carry and sum portions of the partial remainder. Because it is desirable to avoid having to perform a full carry propagate addition during each iteration in order to compute the most significant four bits of the partial remainder, quotient digit selection rules can be developed using an estimated partial remainder.
Typically, the most significant four partial remainder bits are used as to select the quotient digit, as shown in FIG. 1, where the quotient selection logic 102 takes carry and sum portions of the partial remainder to select the quotient digit. For square root calculations the divisor logic 103 substitutes 2Q.sub.i +q.sub.i+1 2.sup.-(i+1) for D. The divisor logic simultaneously produces a divisor D which is used if qi+1=-1 and /D which is used if qi+1=1. D is a function of the previous quotient Q-1i! while /D is a function of the inverted previous quotient Qi!. The three-to-one multiplexor 104 supplies the three-to-two carry save adder 105 with either /D when q.sub.i+1 =+1, 0 when q.sub.i+1 =0, or D when q.sub.i+1 =-1. Negative D is the two's complement of D, which is /D+1; therefore, when q.sub.i+1 =+1, negative D is added to the shifted partial remainder by asserting the carry input 106 of the carry save adder 105. The iterative division and square root hardware shown in FIG. 1 accumulates the quotient Q and the quotient minus one Q-1 an accumulator 107. When the final partial remainder is negative, Q-1 is the proper quotient; when the final partial remainder is zero or positive, Q is the correct quotient. Because the iterative division and square root algorithms generate outputs bits beginning with the most significant bits and continuing to produce output bits with decreasing significance each iteration, the absolute value of partial remainder output by each iteration is either equal to or smaller than the partial remainder stored in register 100, and in either case, the most significant two bits of the resulting partial remainder are equal. The multiplication by two required for the subsequent iteration required by Equations 1 and 2 is accomplished by left shifting the redundant carry save partial remainder by one bit position before clocking into register 100. The most significant carry and sum bits of the output partial remainder are discarded, but because the most significant two bits of partial remainder were equal, the sign of the shifted partial remainder is the same as the output partial remainder. This left shifting is performed by merely wiring the output 106 of the carry save adder 105 to the input of the partial remainder register 100 in a shifted manner.
If the iterative division and square root hardware is implemented in a processor having other functional units, the cycle time is predetermined and is a function of the slowest functional unit on the processor. The critical path that limits the cycle time of the iterative division and square root hardware shown in FIG. 1 is likely to be through the quotient selection logic 102, the multiplexor 104, and the carry save adder 105. If the propagation delay through a quotient selection logic circuit 102 (QSLC) is t.sub.QSLC, the propagation delay through the multiplexor 104 is t.sub.mux, and the propagation delay through the carry save adder 105 is t.sub.csa, then the critical path of the iterative division/square root unit shown in FIG. 1 is as follows. EQU t.sub.crit =t.sub.QSLC +t.sub.mux +t.sub.csa ( 4)
If t.sub.crit is less than the predetermined cycle time of the processor, the best performance gain is then achieved by maximizing the number of iterations performed per cycle. Therefore, instead of producing only one quotient digit per cycle as in FIG. 1, it is desirable to produce multiple quotient digits per cycle. In order to produce multiple quotient digits per cycle, it will be necessary to minimize the latency of quotient digit computation.