1. Technical Field
The present invention generally relates to a multiplication apparatus and method for increasing the efficiency of multipliers, and more particularly, to an apparatus and method for increasing the performance of a low radix multiplier such that high radix performance can be achieved without a significant increase in wiring circuitry.
2. Description Of Related Art
Currently, the speed of many arithmetic operations in present processor implementations is increased by utilizing a floating-point processor. A floating-point processor usually includes carry save adders to increase the performance of multiplication operations.
Generally, there are two popular stages of radix multiplication for microprocessors. High radix multiplication (radix 8 or greater) has the advantage of requiring fewer partial products to be generated and summed, as compared with low radix multiplication (radix 4 or lesser). However, high radix multiplication requires that complex multiples of the X operand to be generated. An example of this is illustrated in FIG. 1B with regard to the 3X and -3X operands required for radix 8 multiplication applications.
FIG. 1A depicts the radix 4 multiplication table, the 3 multiplier bits, and X operand multiples. As can be seen for radix 4 multiplication, only the simple multiples of zero, 1X, and 2X are required for the operand. As it is known in the art, a multiple of a number can be easily generated for the zero, one, and two multiples. A zero multiple requires only that the value be reset, zeroed out, or cleared out. A xe2x88x921X multiple requires that the complement of the operand be obtained. A 2X multiple of a number is easily generated for the number by performing a left shift by one position on the number. A xe2x88x922X multiple of a number is obtained by acquiring the complement of the 2X multiple times.
FIG. 1B depicts the radix 8 multiplication table, the 4 multiplier bits, and X operand multiples. As can be seen by referring to FIG. 1B, radix 8 multiplication requires the multiples of the zero, 1X, 2X, 3X, and 4X. As noted above, the zero, 1X, and 2X multiples are fairly straightforward and easy to compute. However, the 3X and xe2x88x923X multiples required for radix 8 multiplication are quite complex and require special circuitry, such as carry look ahead adders, to compute. The 3X and xe2x88x923X operand multiples are computed by using a carry propagation adder that adds the 1X and 2X multiples to generate the 3X multiples and by acquiring the complement of the 3X multiple. The 4X and xe2x88x924X multiples are fairly straightforward and easy to compute. The 4X and xe2x88x924X multiples are computed by performing a left shift by two positions for the binary number and by acquiring the complement of the 4X multiple.
As stated above, a major problem with radix 8 multiplication is the generation of the 3X and xe2x88x923X operand multiples.
While the simplicity of radix 4 multiplication is often preferred to radix 8 multiplication there are some advantages of radix 8 multiplication. First, radix 8 multiplication generates fewer partial products that must be dealt with.
In this regard, radix 4 multiplication often requires many more carry save adders as compared to radix 8. For example, for a 64-bit array, radix 4 multiplication usually requires 33 rows of carry save adders to compute the product. For a 64-bit array, radix 8 multiplication requires only 22 rows of carry save adders to compute the product. This is computed utilizing the formula xe2x80x9c(number of bits manipulated+number of bits of the multiplier)/number of bits of the multiplierxe2x80x9d. For radix 4 multiplication the formula equals [(64+2)/2]=33 rows, and for the radix 8 application the formula equals [(64+3)/3]=22 rows. This reduction of 11 rows for computing the multiplication product reduces the delay of the multiplier by the speed of at least one gate delay per row.
Illustrated in FIG. 1C is a table for illustrating the levels of carry save adders required for K operands using the optimal Wallace tree architecture in the prior art. This table was obtained empirically by drawing tree structures for various word sizes. The Wallace tree summation network utilizes the fewest number of carry save adder delays.
FIG. 1D depicts a diagram of an example of a Booth-2 (radix 4) multiply with partial products for multiplying two 16 bit numbers. As can be seen by referring to FIG. 1D, there are nine rows of partial products to be added together to compute a final product for the two operands. To this end, the partial products form columns of partial product bits, and as known in the art, each of the bits in one column should be added together to produce one of the bits of the product. The least significant bit of the sum of all of the bits in the column represents the product bit for the bit position corresponding to the column. The other bits of the sum are shifted to the adjacent column for inclusion into the summation of the adjacent column. By summing each of the columns in this way, the product can be determined. Note that the additional 1""s (xe2x80x9c+1xe2x80x9d) on the right side of the partial product depicted by FIG. 1D are needed to complete the 2""s complement for cases when a negative booth multiple is selected.
FIG. 1E depicts a diagram of an example of a Booth-3 (radix 8) multiply with partial products. As can be seen in this example, there are only six rows of partial products to be added together to compute a final product for the two 16 bit operands. This is accomplished because the radix 8 multiplier generates fewer partial products by generating 3X and 4X multiples. As can be seen in FIG. 1E, the partial products generated by the radix 8 multiplier contain an offset of three extra bits per partial product as compared to the partial products generated by the radix 4 multiplier (FIG. 1D), thereby requiring a larger shift per partial product row. This larger shift per partial product row leads to increased wiring complexity.
FIG. 1F depicts a block diagram of a prior example of a linear summation array multiplier 7 for partial products. As can be seen, each of the carry save adders (CSA) receives a partial product term (P). Each of the carry save adders also receives a sum (S) and carry (C) term from two previous carry save adders. This is a simple architecture to implement and has a regular structure. The linear summation array multiplier 7 may be utilized to compute a final product for the two operands of FIG. 1D. The nine rows of partial products (FIG. 1D) are added together one bit at a time. Although this structure is one of the simplest and most regular of all known summation structures, it also exhibits one of the highest delays making it impractical for adding a large number of partial products.
FIG. 1G depicts a block diagram of a prior example of an odd/even summation array 8 for partial products in a multiplier. As can be seen by referring to FIG. 1G, each of the carry save adders (CSA) receives a partial product term. Each of the carry save adders also receives a sum and carry term from two previous carry save adders. However, in this odd/even summation implementation, the sum and carry terms from previous carry save adders skip every other row. While this architecture is more complex to implement, it has the advantage of having approximately one-half the number of adder delays as the linear summation array multiplier 7 (FIG. 1F).
FIGS. 2A and 2B illustrate an example of conventional linear summation circuitry 30 that may be utilized to add the partial product bits in a column of partial products to produce a bit of the product of two operands. In this regard, the circuitry 30 depicted in FIGS. 2A and 2B may be utilized to add a column of partial product bits for up to eighteen rows of partial products. Since the circuitry 30 adds a bit from each row of the radix 4 partial products, the circuitry 30 depicted by FIGS. 2A and 2B may add up to 18 bits of information.
The linear summation circuitry 30 utilizes a linear array of 16 carry save adders 31-38 and 41-48. A linear array of adders is an array of adders in which each of the adders transmits each of its outputs to the same next adder. As shown by FIGS. 2A and 2B, the configuration of the linear summation circuitry 30 has a regular and efficient layout. This configuration provides for light loading and simple interconnections for the carry save adder cells. However, the linear summation has a problem in that it produces a large number of carry save adder delays (in this case 16 delays) in the critical path.
FIG. 3 depicts a block diagram of conventional odd/even summation circuitry 50 that may be utilized to add the partial product bits in a column of partial products to produce a bit of the product of two operands. In this regard, the circuitry 50 depicted in FIG. 3 may be utilized to add a column of partial product bits for up to eighteen rows of partial products. Since the circuitry 50 adds a bit from each row of the partial products, the circuitry 50 depicted by FIG. 3 may add up to 18 bits of information.
As can be seen by referring to FIG. 3, the odd/even summation circuitry 50 utilizes a first linear array of carry save adders 51, 53, 55, 57, 61, 63, and 65 and a second linear array of carry save adders 52, 54, 56, 58, 62, 64, and 66. Approximately half of the partial product bits 1, 2, 3, 5, 7, 9, 11, 13, 15, and 17 are added by the first linear array of carry save adders, and the remaining bits 4, 5, 6, 8, 10, 12, 14, 16, and 18 are added by the second linear array of interconnected carry save adders. The results produced by the first and second linear arrays of carry save adders are then added via carry save adders 67 and 68 to produce the appropriate results for the circuitry 50.
In this example of eighteen partial product bits depicted by FIG. 3, the configuration of the odd/even summation circuitry 50 causes only nine carry save adder delays. When the odd/even summation circuitry 50 is utilized to the fullest extent, the layout produces seven fewer carry save adder delays, as compared with the linear summation circuitry 30 (FIGS. 2A and 2B). Utilizing the odd/even summation circuitry 50 usually requires more complex wiring than a simple linear array with increased loading on the carry save adders, but the odd/even summation circuitry 50 is substantially faster than linear summation circuitry 30 because the odd/even summation circuitry 50 performs parallel summations. This parallelism achieves much of the speed benefits but requires somewhat more complex wiring, as mentioned above.
FIG. 4 depicts a block diagram of a prior example of circuitry 70 utilizing a full Wallace tree summation configuration that may be utilized to add the partial product bits in a column of up to 18 rows of partial products. The full Wallace tree summation circuitry 70 shown by FIG. 4 utilizes 16 carry save adders 71-78 and 81-88. In this example for 18 bits, the configuration of the full Wallace tree summation circuitry 70 causes only six carry save adder delays, as shown by FIG. 4. When a full Wallace tree summation configuration is utilized to the fullest extent, the layout produces ten fewer carry save adder delays as compared with linear summation circuitry 130 (FIGS. 2A and 2B) and three fewer carry save adder delays as compared with the odd/even summation circuitry 50 (FIG. 3). However, utilizing a full Wallace tree summation configuration requires the most irregular and complex wiring of the three summation configuration types.
For illustrative purposes, assume that the circuitry 30 and 50 is implemented in a radix 4 multiplier. If a radix 8 multiplier is used instead, then the number of carry save adders can be reduced since the number of partial products is reduced. In this regard, the circuitry 50 of FIG. 3 can be reduced to the circuitry 110 of FIG. 5 when a radix 8 multiplier is used instead of a radix 4 multiplier. Therefore, the odd/even summation circuitry 110 may be implemented with only 10 carry save adders (111-118, 121 and 122), resulting in only six carry save adder delays, as shown by FIG. 5. When the odd/even summation circuitry 110 is utilized to the fullest extent, the circuitry 110 produces the same number of carry save adder delays as the full Wallace tree summation circuitry 70 (FIG. 4), which uses very irregular and complex wired carry save adders. However, utilizing radix 8 partial products requires the much more complicated implementation of radix 8 partial product generators and the generation of the 3X operand multiple.
Thus, a heretofore unaddressed need exists in the industry for summation circuitry having a simple and regular pattern that is capable of achieving near the performance of Wallace tree summation circuitry without requiring the use of high radix multipliers.
The present invention provides an apparatus and method for utilizing regular summation circuitry to optimally sum the partial products produced by a multiplier in terms of speed and wiring complexity.
Briefly described, in architecture, the apparatus includes a first array of odd/even summation circuitry, a second array of odd/even summation circuitry, and a linear array of adders. The apparatus is configured to add a row of partial product bits produced by a multiplier in multiplying a first operand with a second operand. The first array of odd/even summation circuitry produces a first summation of a portion of the partial product bits. The second array of odd/even circuitry produces a second summation of the other partial product bits. The linear array of adders then adds the first summation and the second summation to produce a carry save representation of a product bit (i.e., a bit of the product produced by multiplying the first operand by the second operand).
In accordance with another feature of the present invention, the adders within an array of odd/even summation circuitry can be configured to receive inputs from adders outside of the array to further increase performance. For example, the first array of odd/even summation circuitry described above may include a first linear array of adders and a second linear array of adders. The first linear array of adders may include a first adder and a second adder. The first adder receives output from the second adder and from another adder outside of the first linear array of adders. This other adder receives and adds a plurality of partial product bits from the portion of partial product bits set forth hereinabove. As a result, a larger number of partial products can be added via the first linear array of adders.
The present invention can also be viewed as providing a method for summing partial product bits in a multiplier, the multiplier for producing a plurality of partial products based on a first operand and a second operand multiplied together by the multiplier to produce a product. In this regard, the method can be broadly summarized by the following steps: adding, via an array of odd/even summation circuitry, a first plurality of partial product bits, each of the first plurality of partial product bits from a different one of a portion of the partial products to produce a first plurality of sum and carry bits; adding, via another array of odd/even summation circuitry, a second plurality of partial product bits, each of the second plurality of partial product bits from a different one of the remaining partial products to produce a second plurality of sum and carry bits; adding the first and second pluralities of sum and carry bits; and producing a third plurality of sum and carry bits via the adding the first and second pluralities of sum and carry bits step, the third plurality of bits forming a carry save representation of a product bit of the product.
Other features and advantages of the present invention will become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional features and advantages be included herein within the scope of the present invention.