1. Field of the Invention
The present invention relates to the field of computer arithmetic, and in particular to a method and apparatus for efficient binary multiplication.
2. Art Background
A number of techniques and algorithms exist for performing multiplication in software and computer hardware. With the growth of applications requiring fast multiplication, computer designers have found it necessary to turn to hardware solutions to implement multiplication. However, fast multiplication is a very hardware intensive operation, requiring a large number of devices that occupy a sizable amount of real estate in an integrated circuit. Thus, it is a goal of the computer designer to achieve fast multiplication using the smallest number of devices necessary.
Many components of a computer system would benefit from smaller fast multipliers. FIG. 1 is a standard block diagram of a computer system including a CPU 101 and an arithmetic coprocessor 102. A main system bus 103 links processors 101 and 102 to each other and to a main memory 104 and an I/O processor 105. The I/O processor 105 links the processing units 101 and 102 and the main memory 104 through an I/O expansion bus 106 to various I/O devices, including a secondary memory 107 through a memory controller 108, a printer 109 and a keyboard 110 through I/O controller 111, and to a monitor 112 through a graphics controller 113.
FIG. 2 illustrates CPU 101 in greater detail. An execution unit 201 executes arithmetic operations according to instructions fetched by an instruction sequencer 202 from an instruction cache 203. In a multiplication operation in the CPU, the operands, multiplicand A and multiplier B, are provided to the execution unit 201 by register file 204 over operand buses 205. A result bus 206 carries the result of the arithmetic operation back to register file 204 and to address/data bus 207 for further processing. As is well known in the art, the coprocessor 102 also includes an execution unit.
FIG. 3 illustrates the internal structure of the execution unit 201. The execution unit 201 includes an arithmetic logic unit (ALU) 301, a barrel shifter 302, a dedicated integer multiply/divide unit 303 and a dedicated floating point unit 304. The ALU 301 performs basic numerical and logical operations. The barrel shifter 302 is used for the high speed shifting of a binary number to the left or right by a specified number of bits. In particular, the barrel shifter 302 is used to align the mantissas of floating point numbers and to normalize the exponent of their sum during addition and multiplication operations. The integer multiply/divide unit 303, as the name implies, performs integer multiplication and division, and the floating point unit 304 executes floating point operations. All of the execution unit components are linked to the operand buses 205 and the result bus 206.
In the execution unit 201 of the CPU 101, multiplication takes place in the ALU 301, the integer multiply/divide unit 303, and in the floating point unit 304 to generate the mantissa portion of the floating point product of two numbers. Similarly, multipliers are also found in the multiplier unit of arithmetic coprocessor 102, and in the graphics controller 113 for pixel address generation.
In all of the above applications, the multiplication process generates a total number of bits in the product equal to the sum of the number of bits of the multiplicand A and the multiplier B. However, for the other arithmetic and logic operations the number of bits in the result often equals the number of bits in each of the operands. Thus, in the design of the execution unit 201, the width of each of the operands buses 205 is usually the same as the width of the result bus 206. Designers do not include wider result buses to support multiplication operations in order to save space on the silicon of the CPU. Consequently, assuming that the input operands are each m bits wide, the multiplication result is 2 m bits wide, requiring two steps to transfer the result sequentially on to the m-bit result bus 206.
Several multiplier configurations are currently in use to perform fast multiplication. One of the simplest implementations multiplies the multiplicand A by the multiplier B one bit at a time and adds the resulting terms. For example, for two m=4-bit binary numbers A=1010 and B=1101,
______________________________________ 1010 1101 1010 0000 1010 1010 10000010 ______________________________________
This simple algorithm is inefficient in that the four 1-bit products must be stored until the final addition step is completed. In machine implementations, it is thus desirable to add each one bit product as it is generated to the sum of the preceding terms to form a number called the partial product. Using this simple algorithm, m=4 partial products are generated when multiplying two m=4-bit numbers. As is well known in the art, the addition of the partial products is normally carried out using a Wallace tree comprised of carry-save adders to minimize the delay involved in obtaining the final product.
To reduce the hardware complexity of the multiplication operation, designers typically use the radix-4 modified Booth recoding algorithm. This algorithm produces m/2 partial products, thus requiring half the number of components for m-bit.times.m-bit multiplication than the simple algorithm described above. Multiplication using modified Booth recoding is very well known in the art and is described in J. P. Hayes, Computer Architecture and Organization, McGraw-Hill Publishing Co., 241-250 (1988), and J. L. Hennessy & D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers Inc., .sctn.A.2 (1990).
FIG. 4 illustrates an m-bit.times.m-bit multiplier scheme using modified Booth recoding. In this scheme, a modified Booth recoder 401 recodes an m-bit multiplier B into m/2 groups of three bits each, which are presented to a multiplicand selector 402. The multiplicand selector also receives as an input an m-bit multiplicand A, and produces m/2 partial products. As is well known in the art, the multiplication producing the partial products is accomplished by shifting the multiplicand bits an appropriate number of positions depending on the value of the 3-bit recoded multiplier portion corresponding to each partial product. The partial products are then added using one of several possible conventional adder tree implementations to obtain the final 2 m-bit product in one step. Usually, the adder is implemented as a Wallace tree 403 of carry-save adders. The Wallace tree produces the product in carry-save form as a carry vector C and a sum vector S, which are then added in the final stage of the multiplier using a fast parallel adder 404. Of the 2 m bits of the final product, m bits are stored in a latch 405 and the remaining m bits are dumped on the result bus 206 through a bus driver 406. Because the result bus is only m-bits wide, m bits must be temporarily stored in the latch 405. Note that the logical control of the components of the multiplication apparatus is accomplished by a state machine 407, which issues control signals received by the components, each component having control inputs (not shown) as is obvious to those skilled in the art.
After the first m bits have been dumped on to the result bus 206, the latch releases the remaining m bits and transfers them to the result bus 206. Thus, this multiplier requires only two states to produce the 2 m-bit result. However, the high speed of this multiplier comes at the expense of the need for more hardware than other implementations.
An m-bit.times.m-bit multiplier may be implemented using fewer components by sacrificing processing speed, as shown in FIGS. 5 and 6. This scheme uses an m-bit.times.m/2-bit multiplier in an iterative fashion. The Booth recoder 501 is configured to recode only m/4 groups of 3-bits each in two steps. The m/4 recoded groups applied to m-bit multiplicand selector 502 produces m/4 partial products, half the number of partial products produced by the m-bit.times.m-bit multiplier of FIG. 4. Thus, the number of devices in the carry-save adder tree 503 and in other components are approximately half those used in the implementation of FIG. 4.
Referring to FIG. 6, during a first pass through the multiplier the multiplicand A is multiplied by the least significant m/2 bits of the multiplier B.sub.L, producing the first 3 m/2-bit partial product, B.sub.L .times.A. (Recall that in this configuration only m/2 bits of the multiplier are applied to the m-bit multiplicand, resulting in m/2+m=3 m/2 bits for the interim product.) In a second pass, the multiplicand A is multiplied by the most significant m/2 bits of the multiplier, B.sub.H, producing the second 3 m/2-bit partial product, B.sub.H .times.A, which is shifted left from the first 3 m/2-bit partial product by m/2 bits. These two partial products, when added together, produce the final 2 m-bit product.
The least significant m/2 bits of the final product are the same as those of the first partial product, and are thus valid bits produced during the first pass. However, the most significant m-bits of the first partial product must be added to the appropriate bits of the second partial product to obtain the remaining 3 m/2 valid bits of the final product. Thus, in the implementation of FIG. 5, the most significant m-bits of the first partial product B.sub.L .times.A are fed back through a feedback latch 505 into the adder tree 503 during the second state, where they are added to the second partial product, B.sub.H .times.A. Note that before adding the most significant m bits of the first partial product to the second partial product, the first partial product must be sign extended by m/2 bits to obtain the correct result. The sign extension depends on the sign (most significant bit) of B.sub.L .times.A, the first partial product. If B.sub.L .times.A is positive (MSB=0), then the sign extension consists of m/2 bits of value "0". Conversely, if B.sub.L .times.A is negative (MSB=1), then the first partial product is extended by m/2 "1"bits. Sign extension takes place in the feedback latch 505.
The resulting carry and save words of the tree 503 are added in parallel adder 504 to produce 3 m/2 valid bits in the second state. However, due to the fact that the bus width is limited to m-bits, the implementation requires an additional state to transfer all 2 m bits of the final product to the result bus 206 through bus driver 506. Consequently, although this implementation requires less hardware than the full m-bit.times.m-bit multiplier of FIG. 4, this technique requires three states to perform the multiplication.
A third existing technique accomplishes m-bit.times.m-bit multiplication by iteratively using an m/2-bit.times.m/2-bit multiplier, as shown in FIG. 7. As in the configuration of FIG. 5, a modified Booth recoder 701 recodes half the multiplier bits at a time to produce m/4 recoded groups each time. In addition, in this implementation, the m/4 recoded groups multiply only half the multiplicand bits in multiplicand selector 702 during each pass through the multiplier to produce m/4 partial products with half the number of bits in the partial products than those in the apparatus of FIG. 5. Thus the hardware required by this configuration is approximately half that required by the multiplier of FIG. 5.
During each iteration m/2 valid bits are generated. These m/2 bits can be grouped in an accumulator latch 705 to form two m-bit values to be dumped on the m-bit result bus 206 in two states, or each m/2-bit group can be dumped on the bus in four states. In either case, the total number of states required to generate the final product on the result bus from the time of receiving the input operands is four states. Thus, although this implementation uses less hardware than all of the existing techniques, it operates at half the speed of the multiplier of FIG. 4.
As can be seen from the conventional multiplier implementations, increases in system speed require substantial increases in the number of hardware components, resulting in the use of more area on silicon. Conversely, any attempt to decrease hardware complexity results in an accompanying increase in the number of states necessary to implement multiplication.