Digital computers are designed to perform a variety of arithmetic instructions on binary numerical data. Within the central processing unit (CPU) of these digital computers resides a major subdivision called the arithmetic logic unit (ALU). The ALU is used to perform a variety of data processing and arithmetic operations under the control of the CPU. Of particular significance to the present invention is the architecture, design, and method used for performing floating point division.
As is well known in the field, floating point division is performed by separating the sign, exponent, and mantissa bits of the dividend and divisor, and performing the necessary operations on each corresponding set of bits. The sign bit operation entails taking the exclusive-OR of the dividend sign bit and the divisor sign bit to obtain the quotient sign bit. The exponent operation involves the subtraction of the divisor exponent from the dividend exponent to obtain the quotient exponent. Obtaining the quotient mantissa requires a more complex algorithm and architecture, and is the factor which greatly limits the speed in which a divide can be performed. The divisor mantissa must be repeatedly subtracted from the dividend or partial remainder until the desired precision is reached, which then terminates the operation. Depending on the precision desired and the algorithm used, the mantissa division operation can take up a significant amount of processing time.
Divide operations are slow because division typically involves the trial and error determination of quotient digits. The simplest binary implementation, restoring division, must methodically determine quotient values, one bit at a time. This is done by first positioning the divisor with respect to the dividend and performing a subtraction to calculate a partial remainder. If the partial remainder is zero or positive, a quotient digit of 1 is determined, the next of the divisor is appended onto the partial remainder as the least significant bit, the divisor is shifted and another subtraction is performed. However if the partial remainder is negative, a quotient bit of 0 is determined, and the dividend must be "restored" by adding back the divisor to the partial remainder. This "restoring" results in a significant reduction in performance.
Another drawback to the restoring division method is that a full subtraction must be performed at each step. In high performance 64-bit machines, such as those made by Cray Research, Inc., the assignee of the present invention, performing a full subtraction and a possible correcting addition at each step is far too time consuming in terms of processing speed to be effective.
Nonrestoring division improves on this algorithm by eliminating the need to restore the partial remainder after an unsuccessful subtraction (i.e., a negative partial remainder). However, most nonrestoring division techniques still require performing either a full subtraction or a full addition at each step, thus limiting their desirability as an effective division tool.
SRT Division. The well-known SRT division algorithm is presently a widely accepted algorithm known to reduce mantissa division time.
The SRT divide algorithm is as follows:
______________________________________ Let: p.sub.o = dividend p.sub.i = partial remainder after ith iteration d = divisor q.sub.i = ith digit (2 bits in the present invention) of quotient ______________________________________
The basic iteration is the following: EQU P.sub.i+b =4(p.sub.i -q.sub.i+1 d)
That is, the i+1st quotient digit times the divisor is subtracted from the ith partial remainder. Then the partial remainder is shifted by 2 bits (e.g., multiplied by 4 to form p.sub.i+1.
Initially, 1.ltoreq.p.sub.0 &lt;2 and 1.ltoreq.d&lt;2. Then, at each step, p.sub.i+1 is maintained in the following range: EQU -2d/3.ltoreq.p.sub.i+1 /4.ltoreq.2d/3
That is, before shifting, the partial remainder is between plus and minus 2d/3 range. After shifting, EQU -3d/3.ltoreq.p.sub.i+1 .ltoreq.8d/3.
The next quotient must then be chosen to force the partial remainder back into the plus or minus 2d/3 range.
SRT division is often used with a higher radix redundant digit set, such that only a few of the highest order bits of the divisor and the partial remainder need be examined to select a proper quotient digit.
Radix Four Digit Set. One such set is the radix four redundant digit set. This the set: EQU {-3, -1, 0, 1, 2}
This is a redundant digit set, which means there is more than one representation for any given number (except 0). For example, any number can be represented by a binary equivalent using 1's and 0's. The binary representation may be expanded by using 1, 0, and -1. (-1 will be represented herein using an underline e.g. 1). Now, any number, except 0, can be represented by a plurality of binary equivalents. The number seven (7) for example, can be represented by 0111 or 1001, or the number two (2) can be represented by 0010, 0110, or 1110, as follows:
______________________________________ 0111 =&gt; (0 .times. 8) + (1 .times. 4) + (1 .times. 2) + (1 .times. 1) = 7; 1001 =&gt; (1 .times. 8) + (0 .times. 4) + (0 .times. 2) + (-1 .times. 1) = 7; 0010 =&gt; (0 .times. 8) + (0 .times. 4) + (1 .times. 2) + (0 .times. 1) = 2; 0110 =&gt; (0 .times. 8) + (1 .times. 4) + (-1 .times. 2) + (0 .times. 1) = 2; 1110 =&gt; (1 .times. 8) + (-1 .times. 4) + (-1 .times. 2) + (0 .times. 1) = ______________________________________
Further radix four examples include:
______________________________________ 002 =&gt; (0 .times. 16) + (0 .times. 4) + (2 .times. 1) = 2 012 =&gt; (0 .times. 16) + (1 .times. 4) + (-2 .times. 1) = 2 021 =&gt; (0 .times. 16) + (2 .times. 4) + (-1 .times. 1) = 7 121 =&gt; (1 .times. 16) + (-2 .times. 4) + (-1 .times. 1) ______________________________________ = 7
A well-known means of implementing the SRT divide algorithm is to combine it with the Radix-4 redundant digit set. In this implementation, the following inequalities have been determined to keep the partial remainders in the proper range:
______________________________________ If 4d/3 .ltoreq. p.sub.i .ltoreq. 8d/3, then select q.sub.i+1 = 2 to force the condition -2d/3 .ltoreq. p.sub.i+1 .ltoreq. 2d/3; If 1d/3 .ltoreq. p.sub.i .ltoreq. 5d/3, then select q.sub.i+1 = 1; If -2d/3 .ltoreq. p.sub.i .ltoreq. 2d/3, then select q.sub.i+1 = 0; If -5d/3 .ltoreq. p.sub.i .ltoreq. -1d/3, then select q.sub.i+1 = -1; If -8d/3 .ltoreq. p.sub.i .ltoreq. -4d/3, then select q.sub.i+1 = ______________________________________ -2.
For example, one radix four SRT divider has been described in J. Fandrianto, "Algorithm for High Speed Shared Radix Four Division and Radix Four Square Root", Proceedings of the 8th Symposium on Computer Arithmetic, May 1987, pp. 73-79, which is incorporated herein by reference. However, this and other SRT dividers proposed to date still require a large logic delay path which limits the speed in which a division is accomplished.
One way to measure computation unit performance is by the number of gate delays which can be suffered per clock period. The number of gate delays through the unit per iteration in a given system limits the speed in which the final result can be obtained. A large number of gate delays also limits the clock speed at which a synchronous unit can operate, because intermediate results cannot be latched until the appropriate signal has propagated through all present gate delays without risking the latching of invalid data.
For example, the SRT divider discussed in the Fandrianto reference contains a total of 12 gate delays per iteration in the critical path. This divider is shown in block diagram form in FIG. 1. In operation, CSA 16 forms and latches the sum/carry representation of the partial remainder, CLA 22 adds the upper bits of the sum/carry portion of the partial remainder to arrive at an approximation of the partial remainder. Complementer 26 forms the absolute value of the approximated partial remainder, which is then fed into quotient select logic 30 along with the four upper most bits of the divisor. Quotient select logic 30 contains a look-up table which gives the absolute value of the quotient digit, whose sign is the same as the sign of the partial remainder. Finally, the quotient value is fed into multiplexor logic 12 which selects the correct multiple of the divisor to be subtracted from the partial remainder.
Upon quotient digit selection, the partial quotient digits are placed in one of two shift registers, depending on their sign. Each register receives a value for each iteration. If the quotient digit selected is positive, the plus quotient register (+Q34) receives the appropriate pair of bits and the minus quotient register (-Q38) receives a pair of `0` bits. If the quotient selected is negative, -Q register 38 receives the appropriate pair of bits and +Q register 34 receives a pair of `0` bits. If the quotient selected is `0`, then both registers receive a pair of `0` bits. Each iteration the appropriate digits are loaded into the least significant bit position, then shifted up two bits in preparation to receive the next quotient digits. When the division is completed, the -Q register 38 is subtracted from +Q to give the final quotient. For example:
______________________________________ quotient digits selected = 0 +1 -2 +1 -1 0 +2; then, +Q register = 00 01 00 01 00 00 10 -Q register = 00 00 10 00 01 00 00 Q+ - Q- = 00 00 10 00 11 00 10 (final quotient). ______________________________________
The critical loop for timing purposes in the divider of FIG. 1 is the one from Carry Save Adder (CSA) 16, through the CSA sum 18a and carry 18b latches Carry Lookahead Adder (CLA) 22, complementer 26, quotient select logic 30 and back to multiplexor logic 12. In the critical loop of the divider shown in FIG. 1 there are two gate delays for CSA 16, two for complementer 26, four (at least) for CLA 22, two (at least) for quotient select logic 30 and two gate delays for multiplexor logic 12, for a total of at least 12 gate delays per iteration. For certain high performance computing machines such as those manufactured by Cray Research, Inc., the assignee of the present invention, 12 gate delays does not meet the performance requirements of the machine. There is therefore a need in the art for a very high speed divider which can be used in high performance applications which can arrive at a quotient in the least number of gate delays.