Many data processing applications require that two operands be multiplied together. In particular, signal processing and data encryption applications depend on high speed multiplication operations, often with large word length operands.
The product of two operands is typically obtained through successive additions of shifted strings of bits, with each string representing an intermediate or partial product of one operand with a term from the other operand. The intermediate product terms are summed to obtain the final result. The product (P) of two operands (X and Y) can be represented as: EQU P=X Y=X.times..SIGMA.y.sub.i r.sup.i =.SIGMA..times.y.sub.i r.sup.i,(1)
where y.sub.i is the value of the ith bit of the Y operand, r is the radix for the number system representation used, and the summation runs from i=0 to n-1, with n being the number of bits in the Y operand.
Equation (1) indicates that the multiplication operation is equivalent to the summing of n terms of the partial product (X.times.y.sub.i r.sup.i). For a binary number representation system, the radix equals 2 and y.sub.i equals either 0 or 1. The ith term in the sum is then obtained by a left shift of operand X for i bit positions and multiplication by the digit y.sub.i. The n terms are then summed.
Booth Recoding is a well known method for multiplying unsigned or two's complement numbers. The method is based on the observations that a string of zeros in an operand requires no addition of the partial product terms, just a shifting of the previous partial product, and that a string of ones in the multiplier extending from bit 2.sup.p to 2.sup.q (q&gt;p) can instead be treated as the value 2.sup.q+1 -2.sup.p. These observations have led to the development of a faster method for performing multiplication operations.
Booth's method is carried out by the following steps. Let x.sub.i be the ith bit of an n-bit multiplier X. Bit x.sub.n-1 is the most significant bit and x.sub.0 is the least significant bit. A bit x.sub.-1 =0 is assumed in order to provide closure of the method. The multiplicand is Y. Starting with i=0, bits x.sub.i and x.sub.i-1 of the multiplier are compared. Based on the comparison, the indicated action is performed:
______________________________________ x.sub.i x.sub.i-1 Action ______________________________________ 0 0 Shift Y left with respect to partial product 0 1 Add Y to partial product, then shift Y 1 0 Subtract Y from partial product, then shift Y 1 1 Shift Y ______________________________________
This process is repeated until n comparisons are completed. The result is the product of the two operands.
The above description of Booth's method is based on comparing two bits of one of the operands at a time. If a higher radix value is used, extensions of the method can be made to comparisons of three or more bits. This will further increase the speed with which the multiplication operation is implemented. For example, given two operands expressed as base 4 (modulo 4) numbers, then if three bits of the multiplier X are examined during each comparison, the multiplicand terms to be added or subtracted are 0, Y, -Y, 2Y, and -2Y. The table below shows the appropriate factor to add based on a comparison between bits i+1, i, and i-1 of the multiplier operand X:
______________________________________ Current Pair Previous Bit i + 1 i i - 1 Factor ______________________________________ 0 0 0 0 0 0 1 +Y 0 1 0 +Y 0 1 1 +2Y 1 0 0 -2Y 1 0 1 -Y 1 1 0 -Y 1 1 1 0 ______________________________________
FIG. 1 is a block diagram of a prior art circuit for a multiplier 10 which uses Booth's recoding method to multiply two operands. The multiplication operation executed by multiplier 10 can be described in terms of three processing stages. During the first stage, data representing operands A and B is loaded. During the second stage, operand B is shifted in groups of bits (where each group contains 4 different bits in the case of a modulo 4 recoder) into a Booth recoder, the operand is recoded, and the resultant partial product terms are formed and accumulated. The accumulation phase produces partial sum and carry save data for the sums of the partial products. This stage produces 4 bits of the final product per clock cycle by using a 4 bit carry look ahead adder to combine the least significant bits of the partial products. The final product data is stored in a 512 bit accumulator. The stage continues until all of operand B has been recoded (256 bits in the case of this example), with the 256 bits of final product data generated forming the 256 least significant bits of the final result. In the final stage, the final partial sum and carry save data is added together to produce the 256 most significant bits of the final result. The circuit elements used to implement each of the three stages will now be described.
The data representing operands A and B is input by means of 32 bit data bus 12. The multiplicand operand A data is retrieved from bus 12 and loaded into 256 bit shift register 14, in 32 bit groups, one group with each clock cycle, where clock signal (CLKS) 15 controls the loading of the 32 bit data groups. As operand A is 256 bits in size in this example, 8 clock cycles are required to complete loading it into register 14.
Operand A multiplexer 13 is used to control the loading of data into register 14, and in particular, to maintain the register in an idle state after the operand A data has been loaded and the other operations of the multiplier are being executed. Multiplexer 13 has two inputs: a first input signal which instructs the multiplexer to load operand A data, shifting 32 bit wide groups of operand A data into register 14; and a second input signal which instructs the register not to shift the data being loaded. The no-shift control signal is used during the clock cycles after operand A has been fully loaded in order to maintain the entire operand A data in the register. This capability is needed because clock signal 15 is continuously provided to register 14, which causes the contents of the register to be shifted out with each clock cycle. Therefore, multiplexer 13 is used to provide an idle state so that the data flow into register 14 is properly coordinated with the multiplication stages. In this case, multiplexer 13 and a feedback loop are used to maintain the full 256 bit operand A data in the register for use with the Booth recoding process while clock signal 15 is clocking the register.
Operand A multiplexer 13 decodes the load operand A data, the shift 32 bit wide data groups, and no shift input signals so that the 32 bit shifted data groups of the 256 bit input or the non-shifted 256 bit input to the multiplexer are connected to the multiplexer output. The data shifting function is obtained in a known manner by means of the connections between the multiplexer and register 14. The control signals for selecting which function is implemented by multiplexer 13 are provided by an external sequencer or state machine (not shown) in accordance with the phase of the multiplication operation being executed.
After all of the operand A data has been loaded, multiplier operand B is then loaded in 32 bit groups into 256 bit shift register 16, where register 16 is controlled by clock signal CLKS 15. Operand B multiplexer 17 is used to control the functioning of continuously clocked register 16 in accordance with the stage of the multiplication operation being carried out. Multiplexer 17 has three inputs: a first input signal which instructs the multiplexer to load the operand B data, shifting 32 bit wide groups of the data into register 16; a second input signal which instructs the register not to shift the data and which is used to produce an idle state; and a third input signal which instructs the register to shift the operand B data out of the register in groups of 4 bits. As in the case of multiplexer 13, the control signals for selecting which function is implemented by multiplexer 17 are provided by an external sequencer or state machine in accordance with the phase of the multiplication operation being executed. As operand B is 256 bits in size in this example, 8 clock cycles are required to complete loading it into register 16. Thus, in this example, a total of 16 clock cycles are required to load operands A and B into their respective registers. Furthermore, because of the design of this multiplier, the operands must be fully loaded before the Booth recoding process can begin.
The operand B data is shifted out of register 16 in 4 bit groups because application of Booth's method using a two stage modulo 4 recoder (as in the present example) requires 4 bits of operand B for each recoding operation. The 4 bit groups of operand B data are transferred to Booth Recode Decoder module 18 by means of data bus 19. Booth Recode module 18 evaluates multiplier operand B in successive bit fields to determine what factor of multiplicand operand A to use in forming the partial product terms which are added together to obtain the final product. Since Booth module 18 is a two stage recoder, 2 successive bit fields are recoded during each clock cycle. Each bit field recoding produces two least significant bits of an uncorrected result for the final product and a modulo 4 carry term. Booth module 18 thus produces 4 least significant bits of uncorrected final product data and 2 modulo 4 carry bits per clock cycle. As operand B is 256 bits long in this example, it takes approximately 64 clock cycles (256 bits/4 recoded bits per cycle) to recode the entire operand.
The result of the recoding operation is a control signal which instructs Booth module 18 to select the appropriate factor of operand A (0, A, -A, 2A, or -2A) to use in forming the partial product terms. Since two recode stages are used in Booth recoder 18 of this example, Booth recoder 18 outputs two factors of operand A each clock cycle.
One factor of operand A serves as an input to Partial Sum/Carry Save (PS/CS) Adder Array 0 20, while the second factor of operand A serves as an input to Partial Sum/Carry Save (PS/CS) Adder Array 1 22. Thus, as each group of 4 different bits of operand B is recoded during a clock cycle, two factors of operand A are selected and transferred to adders 20 and 22.
Each of the two PS/CS adders 20 and 22 produces a 260 bit partial sum and a 260 bit partial carry term. As each of the 260 bit wide partial product terms (the factors of operand A) are provided to adders 20 and 22, they are added to the results of the previous addition operation performed by the adders. This results in a new partial sum term and a new carry save term. The two least significant bits of the partial sum term and the least significant bit of the carry save for each addition operation are provided to 4 bit full look ahead carry adder 24. As both adders 20 and 22 are generating partial sum and carry save terms during each clock cycle, two sets of least significant partial sum and carry save bits are provided to adder 24, for a total of 4 least significant bits of partial sum data and two bits of carry save data. This data is combined in adder 24 with the modulo 4 carry bit generated by each recoder stage of Booth recoder 18.
Adder 24 adds the 4 least significant bits of the partial sums produced by adders 20 and 22 during a clock cycle to the 2 carry save bits and the 2 bits of modulo 4 carry data provided by Booth recoder 18. This produces 4 bits of the final product term. Each 4 bit group of final product data produced by adder 24 is shifted into multiplexer 26 which loads 512 bit accumulator 28.
Multiplexer 26 has four different control signals as inputs: a signal which instructs accumulator 28 to shift the data input by 4 bits; a signal which instructs accumulator 28 to shift the data input by 32 bits; a signal which instructs accumulator 28 not to shift the data; and a signal which instructs accumulator 28 to shift the data by 1 bit. As adder 24 produces 4 bit groups of the final product, multiplexer 26 controls the loading of accumulator 28 with the data by shifting the data by 4 bit increments. When operand B is completely recoded and the partial products accumulated, the lower 256 bits of 512 bit accumulator 28 will be filled. The shift data by 32 bits function is used to dump the accumulator data to data bus 40. As discussed previously, the no shift function is used to implement an idle state in which the data is continually clocked back into accumulator 28. This function is needed because the accumulator registers are continuously clocked and the accumulator function is not utilized during all stages of the multiplication operation. The shift data by 1 bit function is used to provide a term of the form 2*(A*B) for use in computing the terms in the square of the sum of two operands.
After all of operand B has been recoded, the appropriate factors of operand A have been added in adders 20 and 22, and the partial sum and carry save data for each cycle has been transferred to adder 24, registers 30 and 32 contain the most significant bits of the carry save operations performed on the operand A factors. CS register 30 is 260 bits in size and is clocked by clock signal 15. PS register 32 is 260 bits in size and is similarly clocked by clock signal 15. The contents of CS register 30 and PS register 32 are used to implement the final addition operation which produces the upper 256 bits of the final product. CS shift register 30 and PS shift register 32 are loaded under the control of multiplexers 34 and 36, respectively.
The final addition stage is performed using the same adders as were used to produce the lower 256 bits of the final product. The contents of registers 30 and 32 are fed back into adder 20 by means of data busses 33 and 35, with adder 20 transferring data to adder 22 by means of data bus 37. As operand B has been completely recoded, operand B register 16 contains all zeros. Thus, the adders are performing an operation equivalent to (A* 0+CS+PS). After adders 20 and 22 are loaded with the contents of registers 30 and 32, the multiplier unit is cycled through the 64 cycles normally required to accumulate the partial products. However, because in this situation operand B is zero, the effect of the cycling is to add the contents of registers 30 and 32.
The result is that during each cycle, the 2 least significant bits from each of adders 20 and 22 are added together in 4 bit adder 24 to produce a 4 bit group of the most significant bits of the final product. Each 4 bit group of the most significant bits of the final product is loaded into 512 bit accumulator 28 using the 4 bit shift instruction of multiplexer 26. After accumulator 28 is loaded with the 256 most significant bits of the final product term, the multiplication operation is complete. The data is clocked out of accumulator 28 in 32 groups and placed on data bus 40.
In the multiplier of FIG. 1, operands A and B must be completely loaded into registers 14 and 16 before the Booth recoding operations are commenced. Given a data bus of width d which can transfer d bits per clock cycle, if the operands are m bits long, then this design requires 2m/d clock cycles to transfer the operands into the registers. This means that 16 clock cycles are required to load two 256 bit operands into their respective registers, assuming the operands are loaded 32 bits at a time. This delays the start of the operand processing until the completion of the 16 clock cycles.
The multiplier design of FIG. 1 is typical in that it uses carry-save addition and registering to minimize circuitry and increase the multiplication rate. High speed multiplication and exponentiation operations require large Booth adder arrays having large partial sum and partial carry registers. Multiplying two m bit operands using a radix 4 Booth recoding multiplier requires approximately m/(2n) clock cycles to generate the least significant half of the final product, where n is the number of Booth recoder adder stages. The number of Booth recoder adder stages is equal to the number of bit groups which are recoded during a single clock cycle. After these m/(2n) cycles, the most significant upper half of the product is obtained by summing the contents of the partial sum and partial carry registers. As noted, this final addition is typically executed using the same Booth adders as were used to accumulate the partial products and carry terms in the previous stages of the multiplication operation.
An important aspect of the multiplier design of FIG. 1 relates to the manner in which it performs exponentiation operations which are often used in encryption applications. It is well known that exponentiation operations can be accelerated by performing squaring operations. Thus, in some cases it is desirable to efficiently calculate the terms in the expression for the square of the sum of two operands. The multiplier of FIG. 1 typically performs a squaring operation of the sum of operands A and B (where A+B!.sup.2 =A.sup.2 +2AB+B.sup.2) by adding the product term A*B twice to the accumulator. Thus, this type of multiplier calculates the intermediate term in the form (A*B)+(A*B). This approach uses an extra addition operation to replace the second multiplication operation that would otherwise be required, and provides an increase in the speed with which the calculation can be performed. Another method of calculating the 2AB term is to form the A*B product term and then shift the term by one bit in accumulator 28 to form the 2*(A*B) term. This is even faster than performing the extra addition. However, this method has the disadvantage that the circuitry used for performing the shift must be capable of handling a 512 bit shift, and hence consumes a large amount of die area and is expensive to implement.
Another feature of the multiplier of FIG. 1 is that a single clock signal is used to control the shifting of data into shift registers 14, 16, 30, and 32, and accumulator 28. Thus, all data loading and processing functions for the multiplication operation are continuously clocked by a common clock signal, with multiplexers used to produce an idle state so as to maintain the status of the registers after the data has been loaded. As this design uses synchronously clocked circuitry, power consumption is dependent upon the clock frequency. Since a high clock frequency is desirable for fast processing operations, this feature results in a high level of power consumption.
What is desired is a multiplier capable of calculating the square of the sum of two operands using the Booth recoding method which is implemented in a faster and more efficient architecture than currently used multipliers. These and other advantages of the present invention will be apparent to those skilled in the art upon a reading of the Detailed Description of the Invention together with the drawings.