Embodiments of the present invention are directed generally to methods and apparatuses for providing digital multiplication, and more specifically to methods and apparatuses that use Wallace multipliers. Embodiments of the invention relate to a multiplier partitioned into two sections wherein one section generates partial products that represent low order bits and a second section that generates partial products that represent high order bits.
Many techniques for providing fast multiplication in various computer systems have been proposed. In general, these techniques are used for integer multiplication or for generating the mantissa in floating point multiplication. The following is a description of the prior art techniques that effect processing speeds in digital multipliers.
In the most basic case, the product of two N-bit binary operands (one a multiplicand, and the other a multiplier) is obtained by using a sum of partial products wherein the partial products, representing an interim result, are obtained by separately multiplying each bit of the multiplier by all the bits in the multiplicand and repeating this step with the next most significant bit of the multiplier until each bit in the multiplier has been used. As each partial product is obtained, it is offset by one bit to the left from the preceding partial product, and then the partial products as a whole are summed. The resulting summation of partial products generates the final product. This summation of partial products technique is typically referred to as long-hand multiplication. The technique is slow and requires the use of several internal registers in a computer system to perform the arithmetic.
As a result, most digital multipliers in a computer system typically add only two partial products at a time, and each partial sum is then added to the sum of the previous partial products. Such a technique reduces process time and memory allocation. However, in creating fast digital multipliers more efficient techniques like Booth""s algorithm or Wallace trees may be employed. The use of Booth""s algorithm allows a fast digital multiplier to reduce the number of partial products by using a method of recoding or encoding one of the operands (e.g., the multiplier). Essentially, by using Booth recoding, one is able to accelerate the addition and the formation of partial products.
Booth""s algorithm (also referred to as modified Booth recoding) essentially recodes a multiplier in order to effectively reduce the number of partial products. By reducing the number of required partial products and the associated additions, the speed of performing a multiplication operation may be increased by a factor of about two. The techniques of Booth""s algorithm and/or modified Booth recoding are described in xe2x80x9cA Signed Binary Multiplication Techniquexe2x80x9d by Andrew Booth, Q. J. Mech. Appl. Math. 4:236-240 (1951); and a xe2x80x9cBooth Encoded Multiplier Generator Using Optimized Wallace Treesxe2x80x9d, by Fadavi-Ardekani, J., IEEE Trans. on VLSI Systems 1(2), p120-125 (1993) which are incorporated herein by reference.
Fast multipliers that use Wallace trees to speed up the summation of partial products are well known. In these fast multipliers, a Wallace tree is used to perform the summation of partial products in a multiplication process. FIG. 1 shows a Wallace tree 43 that includes seven levels of full adders or carry save adders (CSA""s numbering one through eighteen), plus a carry propagate adder 41. In a Wallace tree, a full adder or CSA takes three one-bit binary numbers, and generates one two-bit binary number. The two-bit binary number may either be used as an input to another CSA or used to represent the sum of previous inputs. In order to generate the first layer of the tree, the partial products are separated into columns, and bits with the same weighting average are grouped together (i.e., 1""s, 2""s, 4""s, etc). Next, the columns are divided into threes. Each group of three bits is fed to a full adder, which creates a two bit output. The output from the first layer of the tree represents the same sum as the partial product, however, there are only two-thirds as many bits.
The remaining layers of the Wallace tree are generated by repeating the same process described for building the first layer by separating the bits into columns of equal weight, separating the columns into groups of three bits, and summing them with full adders. This results in an even smaller group of bits. Layers are repeatedly added until each layer reduces the number of bits to two bits per column, at which point two rows of bits are added with the carry propagate adder 41. By using a Wallace tree method, computer processing time may be reduced from being directly proportional to N (the number of bits) to having a Log3/2 N proportionality. This process is discussed by C. S. Wallace in an article entitled xe2x80x9cA Suggestion for a Fast Multiplierxe2x80x9d in IEEE Transactions on Electronic Computers February (1964) which is incorporated herein by reference.
Referring to FIG. 1, an example of a Wallace multiplier 43 with twenty inputs labeled W1 to W39 is shown. The Wallace multiplier 43 includes eighteen three input adders forming seven levels of adders. The various inputs labeled W1 to W39 represent data inputs to the adders. Note that the outputs from each of the adders reduce three initial inputs to two outputs (3:2). The outputs are then provided as inputs to a succeeding adder. As an example, the inputs W3, W5, and W7 (collectively the xe2x80x9cthree summandsxe2x80x9d) will be traced through the system. The three summands are inputs to a carry save adder 18 on level seven of the Wallace tree. The three summands are compressed into two outputs representing a sum (s) bit and a carry (c) bit. The sum (s) bits and carry (c) bits become inputs to a carry save adder 12 on level six of the Wallace tree 43. The carry save adder 12 receives a third input W9. Therefore, the carry save adder 12 has three inputs and compresses those inputs into two outputs, namely, another carry (c) and sum (s) bit.
The outputs from the carry save adder 12 become inputs to another carry save adder 8 on level five of the Wallace tree 43. In addition, the carry bit from a carry save adder 13 on level six of the Wallace tree 43 completes the third input for the carry save adder 8 on level five of the Wallace tree 43. The three inputs for the carry save adder 5 on level four of the Wallace tree 43 includes the carry bit and sum bit from the carry save adder 8 from level five of the Wallace tree 43, and the input W1. The three inputs of the carry save adder 5 on level four are compressed into two outputs, a sum bit and a carry bit. The two outputs from the carry save adder 5 become inputs to another carry save adder 3 located on level three of the Wallace tree. A third input for the carry save adder 3 is obtained from the carry bit of another carry save adder 6 on level four of the Wallace tree 43. The three inputs to the carry save adder 3 are compressed into two outputs, a carry bit and sum bit.
The sum bit and the carry bit from the carry save adder 3 are provided as inputs to different carry save adders. The sum bit from the carry save adder 3 is provided as an input to another carry save adder 2 located on level two of the Wallace tree. The remaining two inputs to the carry save adder 2 are obtained from the carry and sum bits of a carry save adder 4 located on level three of the Wallace tree. The carry bit of the carry save adder 3 located on level three of the Wallace tree is provided as an input to a carry save adder 1 located on level one of the Wallace tree. The remaining two inputs to the carry save adder 1 are the carry and sum bits obtained from a carry save adder 2 located on level two of the Wallace tree. Finally, the three inputs to the carry save adder 1 located on level one of the Wallace tree are compressed into a sum and carry bit. The sum (s) and carry (c) bits of the carry save adder 1 are inputs to a carry propagating adder 41 that produces a final sum.
The above algorithm is only a representative example of the types of combinations that might occur in a Wallace tree reduction. Note that this algorithm is repeated until the original twenty inputs are reduced to two final outputs that reach a carry propagate adder 41. The carry propagate adder 41, or full binary adder (FBA), provides the final sum in a time that is proportional to the logarithm of the number of inputs.
The Wallace multiplier 43 is substantially able to reduce summation time. In general, the Wallace multiplier 43 is comprised of a carry save adders (CSA) also known as 3:2 compressors. That is, the CSA receives 3 inputs and produces 2 outputs and all inputs have the same weight. The sum output has a weight of one and the carry output has a weight of two. For example, Table 1.0 shows the results of a 3:2 CSA. The CSA has three inputs (I1, I2, I3) and two outputs the sum (S) and a carry bit (C) with the appropriate weights assigned:
The following example describes a binary operation using 3:2 CSA""s which are organized hierarchically like a Wallace tree. Referring to FIG. 2A, the multiplicand is 1101010112 (42710), and the multiplier is 1101100112 (43510). In this instance, the 9-bit multiplicand is aligned vertically forming nine separate rows, and the multiplier is aligned horizontally. The multiplication procedure begins by multiplying the first bit of the multiplicand by the multiplier and entering the results starting in the leftmost bit position. The second bit of the multiplicand is then used to multiply each bit of the multiplier. The results are then entered onto the second row, however, the first bit is shifted to the right by one bit position. The succeeding rows are each similarly multiplied with the result being shifted to the right by one bit position. Alternatively, the multiplication may be performed by (1) rewriting the multiplier for each row that contains a bit equal to one in the multiplicand, (2) enter zeroes for each row that contains a bit equal to zero in the multiplicand; and (3) each successive row, excluding the first, should be shifted to the right by one bit position.
Again referring to FIG. 2A, the above procedure produces a parallelogram that is a 9-bitxc3x9717-bit array of partial products. Because 3:2 CSA""s are being used, the inputs are rows of partial products that are combined into groupings of three. In this case, three input groups, each group consisting of three rows of partial products, are labeled 11 (one-one), 12 (one-two), and 13 (one-three). Because the carry bit output of the CSA""s has a weight of two, the output is shifted one bit to the left. Referring to FIG. 2B, when group one-one is added, note that the summation of the three rows of partial products (including one row of zeroes not shown), produces both a sum and a carry. As stated, the sum and carry bits are stored in separate registers, and the results of sum 11 (s11) and carry 11 (c11) are shown. Referring to FIG. 2C and FIG. 2D, the summation of groups one-two and one-three are also shown. Group one-two produces sum12 (S12), carry 12 (c12), and group one-three produces sum13 (S13) and carry 13 (c13). In general, one CSA operation has reduced nine rows of partial products into six rows.
Referring to FIG. 2B-2D, the next step is to reduce the six rows of partial products into four rows. Therefore, another three inputs are provided to a CSA for compression. Referring to FIG. 2E, the compression is performed on sum11 (s11), carry11 (c11), and sum12 (S12). These three inputs are provided as an input to one CSA, and carry12 (c12), sum13 (s13), and carry13 (C13) are provided as the three inputs to another CSA. Referring to FIG. 2F, two CSA""s are then used to produce four rows of partial products. The four rows of partial products are shown in FIG. 2F and FIG. 2G, as groups 21 (two-one) and 22 (two-two). Referring to FIG. 2F, group two-one contains sum21 and carry21, and shown in FIG. 2G, group two-two is composed of sum22 (s22) and carry22 (c22). The four rows are then compressed into two inputs. Specifically, three more inputs sum21 (S21), carry21 (c21), and sum22 (s22) are forwarded to one more CSA and compressed into two rows of partial products. This compression produces sum31 (S31) and carry31 (c31) as shown in FIG. 2H. This output is then directed to a carry propagate adder along with carry22 (C22). The three outputs sum31, carry31 and carry22 become inputs to another CSA and are compressed into two outputs sum41 (S41) and carry41 (c41). Referring to FIG. 2I, the two outputs, sum41 (s41) and carry41 (c41), are added together using a carry propagate adder 41 to produce the final product 1011010101100100012 (18574510). Note that eight CSA operations were performed, however, because some operations were performed simultaneously, a time delay of only four CSA""s was incurred.
Various hardware implementation schemes for multipliers may be used to perform the above described multiplication procedure of FIG. 2. FIG. 3 shows a schematic diagram of a classical multiplier 40 disclosed in the prior art that performs such a function. The multiplier 40 includes two 32-bit registers. A first register 42 stores a 32-bit multiplicand M, and a second register 44 holds a 32-bit multiplier Q. The second register 44 is a shift register for shifting the multiplier to the right as indicated by an arrow above the second register 44. The first register 42 and the second register 44 are electrically connected through an AND gate 46. The output from the AND gate 46 provides an input to a 32-bit adder 48. The 32-bit adder 48 provides an input to a 65-bit shift register 50 that holds the final product A. A feedback loop 52 is used to send carry bits back to the 32-bit adder 48.
The multiplier 40 operates in the following manner: a multiplicand M is loaded into the first register 42 and a multiplier Q is loaded into the second register 44. The multiplicand M is multiplied by the first significant bit of the multiplier Q. In doing so, a partial product A1 is generated. The partial product Al is forwarded to the 32-bit adder 48 and added to any previous partial products existing in the register 50. The result is stored in the 65-bit register 50. The multiplier Q is shifted 1-bit to the right in register 44 and the partial result Al in register 50 is shifted 1-bit to the left. Using the next most significant bit of multiplier Q in register 44 partial products Al through AN are obtained where the multiplier Q is N-bits wide. The process is repeated until each bit of the multiplicand M in register 42 is multiplied by each bit of the multiplier Q in register 44. The final product A, resulting from the N-partial sums, is stored in the 65-bit register 50. This classical approach to binary multiplication is neither fast nor efficient and the technique is typically not appropriate for a large array of numbers.
Referring to FIG. 4, another type of multiplier, a classical array multiplier 74 is shown. The array multiplier 74 is of the EV5/EV6 style (part numbers 21164 and 21264 respectively) which represents two generations of central processing (CPU) cores that implement a reduced instruction set computer architecture provided by Digital Equipment Corporation now Compaq Computer Corporation.
In the classical array multiplier 74 of FIG. 4, the multiplicand bits 78 (labeled M""pcand) are stationary and extends across the top of the array. Note that from left to right, the bits are entered in descending bit order, that is, from the most significant bit to the least significant bit. Further, the multiplier bits 79 (labeled M""plier) are entered vertically downwards into the array. Note that from top to bottom, the multiplier bits are entered in ascending bit order, that is, from the least significant bit to the most significant bit. In addition, at each intersection between the multiplier bits 79 and the multiplicand bits 78, there exists an AND gate (not shown) having the multiplier bit 79 and the multiplicand bit 78 as an input.
The array multiplier 74 is comprised of registers able to store both sum (s) bits and carry (c) bits. In the classical array multiplier 74, the carry bits (c) propagate vertically downwards while the summation bits propagate along the diagonal and to the right. This has the effect of making like product terms always appear on the diagonal. The low order bits appear incrementally from the side of the array and the high order bits appear simultaneously at the bottom of the array. Thus, because the products lie on a diagonal, it is very difficult to perform any column compression as occurs in a Wallace type compression. Therefore, another type of array multiplier able to implement a Wallace type compression is desired.
In multiplying large arrays of numbers, typical digital multipliers are slow, cumbersome, and are generally associated with excessive processing times. However, embodiments of the present invention overcome drawbacks of typical digital multipliers by creating a virtual multiplier by shifting the multiplicand as it traverses the array while the product terms remain stationary. In one embodiment of the present invention low and high order bit results are obtained separately, and Wallace compression is used in an array multiplier to reduce the number of carry save adders in the delay path, thereby, reducing computer processing time. Embodiments of the present invention take advantage of a system that separately requires the low and high order bits during a particular calculation. Specifically, the higher order bits of a product are calculated last or at a slower rate than the low order bits, and a processor may obtain the necessary low order bits first and await delivery of the higher order bits last.
Further, in embodiments of the invention both the low order results and high order results generated during an array multiplication procedure leave the array through the bottom of the array without routing the product terms through a control section. This procedure facilitates obtaining and providing, upon command, either the low order bits, or the high order bits for a particular calculation resulting in less area and wiring complexity. In addition, the product terms appear in the same bit position which allows for a Wallace type compression that further reduces computer processing time and the number of adders required.
In one aspect, the invention features a processor that generates at least one N by M array of partial products. The processor includes a first section that performs a first operation that generates an N by M array of partial products representing low order bits, and a second section that performs a second operation that generates an N by M array of partial products representing high order bits. The processor also includes a compressor that compresses the N by M array of partial products that represent the low order bits after the first operation, and generates carry bits that are utilized in the second operation.
The processor can include a memory for saving a plurality of carry bits generated by the compressor, and can also include a feedback mechanism that recirculates the plurality of carry bits obtained from the memory and utilizes the plurality of carry bits for calculations performed in the second operation. The processor can also include an adder that separately sums the N by M array of partial products that represents low order bits and the N by M array of partial products that represent high order bits.
In another aspect, the invention features a method for electronically multiplying an N bit multiplicand by an M bit multiplier using a P bit Wallace multiplier where N, M and P are integers. The method includes the steps of: providing an N bit multiplicand and an M bit multiplier, multiplying the N bit multiplicand and the M bit multiplier; and generating at least one N by M array of partial products using the P bit Wallace multiplier. The step of multiplying includes a step of performing a first pass using the P bit Wallace multiplier to generate an array representing low order bits. The step of performing the first pass comprises multiplying the N bit multiplicand and the M bit multiplier and generating an N by M array of partial products; performing a second pass using the P bit Wallace multiplier to generate an array representing high order bits wherein performing the second pass comprises multiplying the N bit multiplicand and the M bit multiplier, generating an N by M array of partial products and shifting the N bit multiplicand N bits to the left; and obtaining a first sum for the first set of partial products representing low order bits.
In one embodiment, the present invention relates to an integer digital multiplier used in a PCA-57 RISC microprocessor part number 21164PC. In this embodiment, the multiplicand is shifted as it traverses the array. The product terms remain stationary and because the carry bits are recycled, the present invention emulates having a 128-bit datapath width. Thus, the present invention allows a Wallace type multiplier to be used that has an N-bit datapath width where N represents the bit width of the operands. Thus, a computer programming instruction like Unsigned Quadword Multiply High (UMULH) for 64-bit wide operands is performed by cycling the partial products twice through a multiplier array.
The foregoing and other objects, aspects, features, and advantages of the invention will become more apparent from the following description and from the claims.