The central processing unit of a computer is used to perform many numeric functions. In particular, a floating point unit within the central processing unit of a computer typically performs the operations of division and square root, among others. Various techniques may be used to perform a division or square root operation to produce a result, although tradeoffs in speed and complexity exist. For example, certain subtractive division and square root algorithms converge on a result at a linear rate as a function of the precision of the operand in bits. On the other hand, multiplicative algorithms use iterative approximation and converge quadratically, and hence, are preferable because they are faster.
Goldschmidt's algorithm is the most attractive of several of the multiplicative algorithms because the numerator and denominator multiplications used are independent operations. The independent nature of these multiplications provides a more efficient use of pipelined multiplier hardware. An example of performing a floating point division using Goldschmidt's algorithm is illustrated in FIG. 1. FIG. 1 shows a numerator X to be divided by a denominator Y in order to produce a quotient. In Goldschmidt's algorithm, both the numerator and the denominator are multiplied by a series of numbers labeled r(0) through r(n). The numbers r(i) are chosen such that the denominator converges to the value "1", which necessarily implies that when the numerator is multiplied by these exact same numbers r(i), it will converge to the quotient, X/Y.
In other words, computing this quotient X/Y with Goldschmidt's algorithm involves multiplying both the numerator and denominator by successive values r(i) so that X(i)+1=X(i)*r(i) and Y(i)+1=Y(i)*r(i). Successive values of r(i)=2-Y(i) are chosen so that Y(i) converges to the number 1 and therefore X(i) approaches the quotient of X divided by Y. To ensure rapid convergence the initial value of r(i) is set to a seed value close to 1/Y. Seed values are loaded into a look up table and are indexed with the leading digits of Y. Square roots are similarly computed by letting X=Y=A and iterating over X(i+1)=X(i)*r(i) and Y(i+1) =Y(i)*r(i)*r(i). Successive r(i)'s are chosen so that Y(i) converges to the number "1" and X(i) converges to the square root of A. Also, r(i)=(3-Y(i))/2 and a pre-scaled value close to the square root of Y is read from a look up table and used as the initial value of r(i). Thus, to perform division or a square root operations in the floating point unit of a microprocessor, Goldschmidt's algorithm uses iterative multiplications in order to reach a result.
Accordingly, floating point units incorporating an expensive multiplier may make iterative use of the multiplier for implementing fast algorithms for division, square root, and/or transcendental function computation by extended polynomial approximation. The process of multiplication can be thought of as having two main steps, the first step being the generation of partial products, and the second step being the reduction of these partial products to the final product. A carry-save adder is usually used in partial product reduction to two rows of a result, and one full adder is needed to reduce to one row of a result.
The simplest form of multiplication of a multiplicand (traditionally the numerator) by a multiplicator (the denominator) operates bit by bit on the multiplicator and produces one partial product corresponding to the multiplicand for each bit in the multiplicator. Thus, for an N-bit multiplicand and an M-bit multiplicator, an N by M bit multiplier along these lines generates M partial products to be added, each N bits long. To build this multiplier, a stack of M adders is used to sum up the product. Each partial product is either the value of the multiplicand or the value zero, determined by the individual bits of the multiplicator. Such a multiplier is simple but costly in size, and M additions are necessary to compute the product. However, it is possible to reduce the number of partial products to be added by using Booth encoding.
Booth encoding allows the multiplication process to skip through contiguous strings of 1's and contiguous strings of 0's in the multiplicator. For an N-bit multiplicator, no more than N/2 partial products are created. Modified Booth encoding processes overlapping groups of bits at a time, the number of new bits B per group being described by a radix 2(to the B power). Radix 2(to the B power) Booth encoding generates only (N/B)+1 partial products, and lends itself to parallel hardware implementation. However, when the partial products from one multiplier operation are first added, the result is initially in a redundant format, such as carry-sum. That is, the result takes the form of two rows of binary information, a carry row and a sum row. And in an iterative multiplication, this result needs to be fed back into the Booth encoder before it is again combined with a next number to generate additional partial products. But before this result may undergo Booth encoding again, it must first be processed by an adder to put it into binary format. In other words, the carry row and the sum row must be added first, because the conventional Booth encoder only accepts one row of information. Performing such additions after each multiplication in a series of multiplications consumes precious time.
FIG. 2 shows a prior art multiplicative divider 200 used in quotient generation that uses iterative multiplication. Such a multiplicative divider may be used to perform the division of X by Y using iterative multiplication of a series of numbers r(0) through r(n) as shown in FIG. 1. Shown is an initial input 201, a conventional Booth encoder 204, a partial product generator 206, a Wallace tree compression unit 208 and a 60-bit dedicated full adder 202. The Booth encoder 204 may be any typical Booth encoder as is known in the art. The partial product generator 206 takes as one operand the output of the Booth encoder and takes as a second operand one of the series of numbers r(i). Partial product generators are known in the art and may be implemented using any suitable technique. The Wallace tree compression unit 208 is also known in the art. Wallace tree compression takes the partial products from the partial product generator and using rows of carry-save adders compresses the partial products into two rows, a sum row and a carry row. The 60-bit adder is a conventional adder that adds the sum row and the carry row and feeds the result into the Booth encoder 204. The initial input 201 is used to feed an initial input such as the value X into the Booth encoder 204. The operation of this multiplicative divider will now be described with reference to FIG. 3.
FIG. 3 is a flow chart showing a prior art technique for multiplying a quantity X by a series of numbers r(0) through r(n). For illustrative purposes, the quantity X and each of the numbers r(i) are 60-bit numbers. In a first step 302 a variable Intermediate Result is set equal to the quantity X. This variable Intermediate Result holds the value of X as it is being modified by successive multiplications by the series of numbers r(i). Also in this step, the index i for each of the r(i) is set equal to 0. Next, in step 304, the Intermediate Result is fed into the Booth encoder 204. The first time through the loop the Intermediate Result will simply be the initial value of X. On later iterations through this loop, the variable Intermediate Result will represent a changing value of X as it is multiplied by successive values for the r(i). For a given iteration through the loop of the flowchart, the Booth encoder is used to encode the current Intermediate Result using the conventional Booth algorithm. In step 306 the partial product generator takes as one input the Booth encoded Intermediate Result and takes as a second input the next number r(i) in order to form the partial products for that multiplication. The partial product generator 206 then outputs 31 partial products (or rows) of 60 bits each. In step 308 the Wallace tree compression unit 208 uses carry save hardware to perform compression in order to reduce the 31 rows of partial products into one sum row and one carry row of 60 bits each.
Because the conventional Booth encoder will only accept one row of data, this sum row and carry row must be added together before being fed into the Booth encoder for the next iteration of the algorithm. Thus, in step 310, an additional dedicated full adder must be used to add the sum row and carry row to produce the new Intermediate Result. Thus, it can be seen that this step of adding is performed every time through the loop of the algorithm and adds to the length of time to perform a division.
Continuing with FIG. 3, step 312 then checks whether the multiplication is done. If so, then this operation is over and the final result X/Y may be output from the 60-bit full adder 202. However, if the multiplication is not yet finished, then in step 314 the index i is incremented by 1 to prepare for another iteration through the loop. The loop then continues when this 60-bit Intermediate Result is fed from the 60-bit dedicated full adder 202 back into the Booth encoder 204 once again. The algorithm continues in this iterative fashion until the multiplication is done. It should be noted that this conventional Booth encoder will only accept one row of data, and that the adding step of 310 is performed each time in the iteration, thus adding extra time per cycle and greatly increasing the time needed to perform such a multiplicative division. In addition, this technique requires the use of the additional 60-bit dedicated full adder 202, thus increasing the hardware complexity.
Various techniques have been used in order to avoid this additional adder and the extra cycle time because of it. Typically, an N-bit full adder is slow because as the two N-bit numbers are added, a carry generated in the lower order bits may ripple all the way through to the higher order bits. These high order bits must wait to be added until the carry has rippled all the way through the N bits. Carry look-ahead and carry select full adders may be used but are still relatively slow. It is for this reason that prior art techniques have attempted to create a design that avoids this carry ripple through all N bits. In particular, to avoid the N-bit full adder in iterative multiplicative algorithms, others have tried to design logic derived from an adder in order to generate the signals without a carry ripple, but the signals still need to be routed to a conventional Booth encoder. This technique is slow in terms of timing and delay. One such approach is disclosed in U.S. Pat. No. 5,280,439 issued on Jan. 18, 1994 to Quek, et al.
Another technique that has been used to avoid this extra cycle time is described in "Redundant Binary Booth Encoding" by Chung Nan Lyu and David W. Matula of Southern Methodist University, 1995 IEEE. This paper describes using a sign digit redundant coding scheme in order to avoid the N-bit full adder. But using this technique requires a sign digit multiplier array that is not as efficient and fast as a normal binary multiplier array. Also, there is overhead involved in the conversion of sign digit coding from and to binary coding.
In summary, a conventional Booth encoder is typically used in a multiplicative divider to reduce the number of partial products in the multiplier array. This conventional Booth encoder only takes one row of data. However, the outputs of an N-bit multiplier array in microprocessor applications are usually in the form of a carry row and a sum row. And in iterative multiplicative algorithms, the outputs of this N-bit multiplier array need to be fed back into the multiplier array. Thus, an N-bit full adder is needed to add these carry and sum rows together into one row of data so that the result may be fed back into the conventional Booth encoder. The addition performed by this N-bit full adder causes one cycle delay in each iteration which severely degrades the performance of the multiplication.
Therefore, it would be desirable to have a multiplying apparatus using a modified Booth encoder that dispenses with the need for this additional full adder. It would further be desirable for this multiplying apparatus to be fast and simple in design.