1. Field of the Invention
The invention relates to digital multipliers, and more particularly, to Boolean multipliers.
2. Background of the Invention
During the twentieth century, mankind has become dependent on computers for calculation and storage of scientific and business information. The processing of this information by computer requires the iterative application of basic mathematics functions. Integer binary addition and subtraction operations are easily performed in timing comparable to processor speeds. Multiplication (and inversely division), however, has traditionally been a time consuming process. Improving the processing of multiplication and division instructions can greatly enhance the power of modern computers, especially in large simulation problems that require millions of multiplication or division operations to complete.
In order to design processors with increasing speed and improved efficiency, it is necessary to minimize the delay associated with mathematical operations. The ideal is to utilize a design that produces results in one clock cycle. Addition and subtraction are easily accomplished under this constraint due primarily to carry-look-ahead adders (CLA). C. H. Chen (ed.), Computer Engineering Handbook, New York, McGraw-Hill, pp 4.5-4.7, 1992; O. L. MacSorley, "High-Speed Arithmetic in Binary Computers," Proceedings of the IRE, 49, pp 67-91, 1961. Carry-look-ahead adders produce all sum terms of an addition practically simultaneously.
Multiplication, and its inverse, division, require numerous gate delays per bit of output and are rarely capable of producing a result in a single processor clock cycle. To allow for this delay, modern RISC processors utilize a pipeline architecture that breaks the multiplication process into roughly equal tasks that can be completed in a clock cycle. A new multiplication can be launched into the pipeline each clock cycle, and one result retrieved each cycle. When programs require large numbers of multiplications to be performed sequentially, the arithmetic pipeline output approaches one operation per cycle if there are no data dependencies.
Problems that require numerous sequential multiplications, however, are rare. Programs usually contain a multiplication nested among other operations and possibly among conditional branches. In these cases the processor must wait for the multiplication to be completed by the arithmetic pipeline. This delay (latency) can cause pipeline stalls and reduce the effectiveness of the processor. As processors achieve greater clock speed and strive for higher throughputs, single-cycle multipliers become the goal of designers.
Another example of systems that are limited by multiplier speed are neural networks and systolic arrays. H. T. Kung, "Why Systolic Architectures?" Computer Magazine, Volume 15, January 1982, pp 37-46. These two parallel forms are constructed from numerous elementary processors, or cells, that perform simple operations on the data they receive then pass the modified data on to other cells that similarly modify and pass the data. In digital neural networks (analog neural networks also exist), the weighting of input data must be accomplished by some type of binary multiplier. This multiplication delay is the dominant factor in determining the length of each processor cycle and therefore directly influences the array throughput rate. The need for faster multipliers is critical in this specialized programming field.
Many methods exist for performing binary multiplication in computer systems. In processors currently in production, multipliers are derivatives of the parallel multiplier (discussed in greater detail belwo). These designs, while combinational in practice, are essentially based on multiplication algorithms rather than Boolean logic. The Boolean expression for the multiplication function has historically been overlooked, possibly because of high gate count requirements and complex Boolean expressions required to realize a Boolean multiplier.
High gate count is also encountered in traditional multipliers, with the number of gates increasing exponentially to the product size. Because of the exponential relationship between multiplier scale (number of bits in operands or product) and multiplier complexity (transistor count), estimations of actual multiplier size can be produced from baseline designs. FIG. 1 gives an idea of the size of 8-bit multipliers. With VLSI (Very Large Scale Integration) systems becoming commonplace and multi-million transistor circuits no longer rare, the problem has been reduced to that of mastering the Boolean expressions.
The history of using machines to compute mathematical functions is ancient and diverse. The abacus has been in use for more than 5,000 years. Encyclopedia Britannica, 15th Edition, Vol. 16, p 640, Chicago, 1989. In 1642, Blaise Pascal built the first adding machine and, in 1673, the first machine to include the multiplication function, Gottfried Wilhelm Leibniz's Stepped Reckoner, was built.
For centuries, mathematical functions were realized mechanically allowing operations like multiplication to be accomplished in a few seconds. In the twentieth century, computing equipment evolved into electromechanical systems capable of performing multiple multiplications per second and finally, by the 1950s, into totally electronic machines executing thousands of arithmetic operations per second.
Modern digital computers are often rated by how many mathematical operations they can perform in a given time (e.g. MFLOPS, or Millions of Floating Point operations per second). Of these millions of operations, multiplication and division are the most time consuming. In counting normalized floating point operations per second, multiplication is given a weight of 4 times that of addition. In his 1964 paper that introduced the Wallace tree adder, C. S. Wallace claimed that the arithmetic unit of a computer "used for scientific computations, will spend nearly half of its time multiplying or dividing". C. S. Wallace, "A Suggestion for a Fast Multiplier," IEEE Transactions on Electronic Computers, Volume 13, pp. 14-17, February 1964. The processes of multiplication and division have remained key topics for research throughout the history of the computer.
The first multiplication methods used with computers and calculating machines were, in nature, algorithms. The iterative addition method simply adds the multiplicand to itself as many times as the multiplier dictates. This method requires minimal code to implement and only one carry bit per iteration. Mechanical accounting machines utilized a hardware realization of this method for multiplication. Obviously this process can be extremely slow for large numbers.
The second method is based on the system used for multiplying decimal numbers. Each digit of the multiplier operates on the entire multiplicand to produce intermediate terms (partial products). The partial products from each operation are shifted to the proper location with respect to the exponent value of the multiplier digit, then all the intermediate terms are added to produce the final product. C. H. Chen (ed.), Computer Engineering Handbook, New York, McGraw-Hill, pp 4.5-4.7, 1992. FIG. 2 demonstrates this technique in identical multiplications, one in decimal, the other in binary.
This "left shift" multiplication technique requires a more complex algorithm for implementation but offers a great increase in speed over the iterative method since the number of operations is always equal to the number of bits in the operands. Left shift arrays can be easily constructed in two parts, a partial product generator; which is simply an array of logical AND gates, and the addition array; which combines all partial products to obtain the final result. Notice, however, that the sign bit is not actually generated by this technique, but must be calculated separately. This organizational requirement led Booth, in 1951, A. D. Booth, "A Signed Binary Multiplication Technique," to develop a method that returned the correct two's complement value. Quarterly Journal of Mechanics and Applied Mathematics, Volume 4, number 2, pp. 236-240, 1951.
Booth observed that performing a 2's complement multiplication on a multiplier designed for unsigned binary numbers produced predictable error. This result error is based on the use of 2's complement numbers. A negative number in 2's complement is defined as: EQU -m=2.sup.n -m (1.1)
for an (n-1) bit binary integer, m. (-m=2-m, for fractional binary representation.) Expanding the multiplication process around the 2's complement definition yields; EQU m.times.r=mr (2.1) EQU (-m).times.r=r(2.sup.n -m)=2.sup.n r-mr (2.2) EQU m.times.(-r)=m(2.sup.n -r)=2.sup.n m-mr (2.3) EQU (-m).times.(-r)=(2.sup.n -m)(2.sup.n -r)=2.sup.2n -2.sup.n m-2.sup.n r+mr(2.4)
In each of theses cases, the final term (.+-.mr) gives the correct result and the remaining terms are error. From this Booth concluded that correcting the output simply required applying the following process:
(1) if m is negative, subtract 2.sup.n r from the product; PA1 (2) if r is negative, subtract 2.sup.n m from the product; PA1 (1) If m.sub.n =0, m.sub.n-1 =0, multiply the existing sum of partial products by 2.sup.-1, i.e. perform an arithmetic shift right. PA1 (2) If m.sub.n =0, m.sub.n-1 =1, add r to the existing sum of partial products and perform arithmetic shift right. PA1 (3) If m.sub.n =1, m.sub.n-1 =0, subtract r from the existing sum of partial products and perform arithmetic shift right. PA1 (4) If m.sub.n =1, m.sub.n-1 =1, perform arithmetic shift right on sum of partial products. PA1 (5) Do not perform arithmetic shift right at MSB. PA1 Note: It is assumed that m.sub.-1 =0. PA1 (1) If the bits are identical, treat the situation like sign extension (shift right). PA1 (2) If m.sub.n =1, m.sub.n-1 =0 treat as a two's complement (subtract r and shift right). PA1 (3) If m.sub.n =0, m.sub.n-1 =1 adjust incorrect two's complement (add r and shift right). PA1 (4) Case (3) cannot occur at the zeroth bit since m.sub.-1 =0 is assumed. PA1 (1) If m.sub.i+1 =0, m.sub.i =0, m.sub.i-1 =0, then P=P/4, i.e. multiply the existing sum of partial products by 2.sup.-2 (perform two arithmetic right shifts). PA1 (2) If m.sub.i+1 =0, m.sub.i =0, m.sub.i-1 =1, then P=(P+r)/4, i.e. add r and then perform two arithmetic right shifts. PA1 (3) If m.sub.i+1 =0, m.sub.i =1, m.sub.i-1 =0, then P=(P+r)/4, i.e. add r and then perform two arithmetic right shifts. PA1 (4) If m.sub.i+1 =0, m.sub.i =1, m.sub.i-1 =1, then P=(P+2r)/4, i.e. add 2r and then perform two arithmetic right shifts. PA1 (5) If m.sub.i+1 =1, m.sub.i =0, m.sub.i-1 =0, then P=(P-2r)/4, i.e. subtract 2r and then perform two arithmetic right shifts. PA1 (6) If m.sub.i+1 =1, m.sub.i =0, m.sub.i-1 =1, then P=(P-r)/4, i.e. subtract r and then perform two arithmetic right shifts. PA1 (7) If m.sub.i+1 =1, m.sub.i =1, m.sub.i-1 =0, then P=(P-r)/4, i.e. subtract r and then perform two arithmetic right shifts. PA1 (8) If m.sub.i+1 =1, m.sub.i =1, m.sub.i-1 =1, then P=P/4, i.e. multiply the existing sum of partial products by 2.sup.-2 (perform two arithmetic right shifts). PA1 where i=0 to n-1, in increments of 2, for n-bit binary operands.
the application of both of these corrections gives the correct result in the last case since m and r are negative, the subtraction is in effect addition of the negative terms, and 2.sup.2n is ignored by the machine (2.sup.2n is beyond the MSB of the product). Booth applied the correction terms by utilizing the following examine and adjust algorithm during the multiplication:
Booth's process amounts to adjusting for the correction factor during the multiplication process. His correction scheme works by treating each pair of bits as if they represent a magnitude and a sign bit. Using only this idea the algorithm must obey the following conditions: if the more significant of the two bits is a zero, the multiplication is treated as a positive integer multiplication; if the more significant bit is one the multiplication is treated as if needing two's complement correction. The algorithm must also account for the case when two's complement correction occurs unnecessarily. If the more significant bit is zero and the lessor bit is one, the algorithm adjusts for incorrect two's complement correction by adding r back to the sum.
Booth's algorithm can now be described by four cases:
There are two inherent problems with the Booth algorithm. First, the algorithm does not efficiently lend itself to combinational logic. Each pair of bits must be compared to determine the proper operation. This comparison can be accomplished using a multiplexor to select the proper operation. Each operation, however, cannot proceed until the previous operation is complete; therefore a sequential delay is produced that is on the order of n, for an n-bit multiplication.
Second, Booth's algorithm, as stated in the original publication, produces a (2n-1)-bit solution. Because of this, there exists a case for which the Booth algorithm does not return a complete result. In two's complement notation, the multiplication of the most negative n-bit number times itself requires a 2n-bit result as discussed in greater detail in the Detailed Description of the Preferred Embodiment (The N-bit by N-bit Proof). Booth never defined this MSB so an ambiguity exists. In the cases when the 2n-2 bit is 1, the 2n-1 bit can be either 1 or 0.
In 1961 MacSorley published a modification to Booth's algorithm that examines 3 adjacent bits and can operate in n/2 cycles or a sequential delay on the order of n/2. O. L. MacSorley, "High-Speed Arithmetic in Binary Computers," Proceedings of the IRE, 49, pp 67-91, 1961. MacSorley's modified Booths algorithm consists of the following conditions:
MacSorley's modified algorithm obviously requires only half the iterations as Booth's original algorithm, but suffers from the same constraints. To produce a hardware array, an even larger multiplexor must be used for each stage, and each stage depends on the completion of the previous stage.
More recently (1974), Baugh and Wooley published an algorithm for simultaneously producing the intermediate terms from two's complement operands so that no correction was required in the product. C. R. Baugh and B. A. Wooley, "A Two's Complement Parallel Array Multiplication Algorithm," IEEE Transactions on Computers, Volume C-22: pp1045-1047, December 1973. Baugh and Wooley found that by inverting terms that were ANDed with the sign bits and adding minimal correction terms, the correct result could be directly obtained. FIG. 3 shows the Baugh-Wooley multiplication algorithm.
Later that year, Blankenship published a short note on Baugh and Wooley's algorithm, suggesting equivalent simplified logic for the correction terms. P. E. Blankenship, "Comments on `A two's Complement Parallel Array Multiplication Algorithm,`" IEEE Transactions on Computers, C-23: p1327, 1974. This modification to the Baugh-Wooley algorithm is shown in FIG. 4.
The algorithm has continued to evolve to the point that NANDs are now used to produce the inverted terms and only two correction factors (both constant 1's) are required for multiplications where the lengths of the two operands are equal (see FIG. 5). It may be noted that all multiplication can be expressed as having operands of equal length with sign extension of the most significant bit in the shorter operand.
While Booth, MacSorley, Baugh, and Wooley were working on the production of the partial products, another line of research centered on the problem of adding the partial products once they had been produced.
This "reduction" process is accomplished in the classical parallel multiplier array by the use of numerous adders and half adders. These adders and half adders may be distributed among the AND and NAND gates that generate the partial products or located in segregated arrays. C. H. Chen (ed.), Computer Engineering Handbook, New York, McGraw-Hill, pp 4.5-4.7, 1992; D. P. Agrawal, "High Speed Arithmetic Arrays," IEEE Transactions on Computers, Volume C-28, March 1979, pp 215-224; D. Somasekhar and V. Visvanathan, "A 230 MHz Half-Bit Level Pipelined Multiplier Using True Single-Phase Clocking," IEEE Transactions on VLSI Systems, Volume 1, number 4, December 1993, pp 415-422. Using numerous adders and half adders allows a designer to create a cellular symmetric design that is easy to implement and optimize for space but exhibits excessive timing delays (on the order of 4n gate-delays for an n-bit input). These symmetric, cellular designs are referred to as parallel multipliers, but this is actually a misnomer. These devices could more correctly be called ripple-carry multipliers.
C. S. Wallace first attacked the problem of the addition delay with two new insights in 1964. The main focus of Wallace's work was to use pseudo adders that can combine three terms (the pseudo adders are actually just full adders). C. S. Wallace, "A Suggestion for a Fast Multiplier," IEEE Transactions on Electronic Computers, Volume 13, pp. 14-17, February 1964.
Wallace first insight was to connect these adders to combine three partial product bits and produce a sum and carry bit that would be reduced in the next addition cycle. Pryor to Wallace's work, half adders were used to add two partial product bits producing a carry bit and a sum bit. These sum and carry bits would then be added to the correct magnitude bits from the next partial product using full adders (each full adder receiving one carry and two sum inputs). The resulting sum and carries were combined with the next partial product until all terms had been combined. This re-arrangement of the full adder input bits, by itself, reduces the time to add the partial products by a factor of 1.5.
Wallace's second insight was perceiving that since all the partial products were produced simultaneously, all partial products bits of the same order could be combined simultaneously. In the classic parallel multiplier (and even in some more recent pipelined realizations, the first partial products are added together then the result added to the next partial product until all partial products have been combined (as in above discussion). D. P. Agrawal, "High Speed Arithmetic Arrays," IEEE Transactions on Computers, Volume C-28, March 1979, pp 215-224; D. Somasekhar and V. Visvanathan, "A 230 MHz Half-Bit Level Pipelined Multiplier Using True Single-Phase Clocking," IEEE Transactions on VLSI Systems, Volume 1, number 4, December 1993, pp 415-422. This process is similar to a ripple adder. Wallace created an adder tree, later called the Wallace tree, that simultaneously input all bits of the same order into pseudo adders, the results of these additions were then fed into the next layer of pseudo adders until finally, 2 long words are produced that are then added in a CLA. This second insight reduces the addition delay exponentially so that the delay due to the addition of partial products is on the order of the logarithm of the partial products. FIG. 6 shows the reduction of 12-bit partial products using a Wallace tree compose of full and half adders. The brackets at the right denote the terms to be combined each cycle. This example reduces the partial products to two terms for final addition in a CLA in 5 cycles.
As a point of interest, Wallace also commented in his paper that he could "see no good reason to depart from this general scheme" of multiplying by producing partial products (summands) and then adding these terms to produce the product.
In 1965 L. Dadda extended Wallace's concept by further developing the idea of pseudo adders. L. Dadda, "Some Schemes for Parallel Multipliers," Alta Frequenza, Volume 34, 1965, pp.349-365. Dadda defined logical blocks called parallel (n, m) counters as combinational networks with m outputs and n (.ltoreq.2.sup.m) inputs. Dadda used a different counter for each n-bit sub-column (parts of a column containing n-bits of the same order) of partial product bits. Each sub-column is reduced to the proper number of sum and carry bits. The next cycle then adds the sum bits to the carry bits of the same order. This is repeated until a single product emerges. This technique can produce a multiplication result in relatively few cycles. FIG. 7 shows a few of the counters developed by Dadda.
In 1976, Dadda revisited this work to include the use of PLAs (Programmable Logic Arrays) and PROMs (Programmable Read Only Memory) to implement the (n, m) counters and incorporate the use of sub-multipliers (multipliers of smaller order than the operands whose outpts could be combined like partial products). L. Dadda, "On Parallel Digital Multipliers," Alta Frequenza, Volume 45, 1976, pp. 574-580.
In 1977, Stenzel et al. categorized multipliers into two groups: purely iterative arrays and devices that generate a matrix of partial product terms. W. J. Stenzel, W. J. Kubitz, G. H. and Garcia,: "A Compact High-Speed Parallel Multiplication Scheme," IEEE Transactions on Computers, C-26: pp948-957, 1977. These partial product generators are also multipliers in themselves and are simply ROM arrays. A 4-bit by 4-bit multiplication can be realized with a 256 by 8-bit ROM, where the address is the combination of the two operands, and the result is the data stored at the particular address. Stenzel then combined these partial product generators with Wallace tree adders and higher level Dadda counter to produce efficient high-order multipliers. A 32-bit by 32-bit multiplier was described and a 24-bit by 24-bit prototype was produced.
Two's complement multiplication is possible with the Stenzel model, but the prototype operated on positive integers only and did not incorporate any type of sign correction. The limitation of this design is the practical size of the ROMs. A pure ROM multiplier of size 8-bit by 8-bit is now practical using ROMs with 64K address space. A 16-bit by 16-bit ROM multiplier, however, would be impossible with current technology due to the unavailability of 4 GB by 32-bit ROMs.
What each of these methods share is a symmetry or algorithmic approach that lends itself to n-bit descriptions. Multiplier designs currently in use represent applications of the techniques described. The only notable improvement has been the pipelining of operations. The most recent published designs still depend on iterative arrays, Baugh-Wooley sign correction, and pipelined addition matrixes which may be incorporate Wallace tree reduction. D. Somasekhar and V. Visvanathan, "A 230 MHz Half-Bit Level Pipelined Multiplier Using True Single-Phase Clocking," IEEE Transactions on VLSI Systems, Volume 1, number 4, December 1993, pp 415-422; D. P. Agrawal, "High Speed Arithmetic Arrays," IEEE transactions on Computers, Volume C-28, March 1979, pp 215-224; S. Lee and W. Hsu, "VLSI Systolic Multiplier and Adder for Digital Signal Processing," Signal Processing, volume 23 (1991) pp205-213; P. J. Song and G. De Mitcheli, "Circuit and Architecture Trade-offs for High-speed Multiplication," IEEE Journal of Solid-State Circuits, Volume 26 pp. 1184-1198, September 1991; J. Mori, M. Nagamatsu, and M. Hirano, "A 10-ns 54.times.54-bit Parallel Structured Full Array Multiplier with 0.5 micron CMOS Technology," IEEE J. of Solid-State Circuits, Volume 26, pp 600-606, April 1991; A. Takach and N. Jha, "Easily Testable Gate-Level and DCVS Multipliers," IEEE Transactions on Computer-Aided Design, Volume 10, No. 7, pp 932-942, July 1991.
In addition to the published literature, U.S. Pat. No. 3,914,589 to Gaskill et al. describes a 4-bit by 4-bit multiplier. The Gaskill multiplier is a Boolean multiplier. The disadvantages of the device is that the design depends heavily on exclusive-OR (XOR) logical gates and produces T and S terms that must be logically combined to produce the multiplication result. The T and S terms have a maximum 3-gate delay and the combination of the T and S terms also requires a 3-gate delay, producing a maximum 6-gate delay (not counting inverter terms). The equations used by Gaskill will be compared in the Detailed Description of the Preferred Embodiment to the Boolean expressions developed in this work.
The problem of multiplying binary numbers in computing machines has been studied in great detail in the last fifty years and is well understood. The basic binary multiplication technique is similar to the method for multiplying decimal numbers. The technique includes the generation of partial products followed by the addition of the partial products to produce the final product. Algorithms have been developed to handle 2's complement numbers and to speed the process of adding the partial products. Despite the extensive research in this area, all the techniques that have been documented for binary multiplication include the production of partial products and the reduction, or addition, of these intermediate terms to produce the final product. As such, a need continues to exist for a multiplier capable of functioning in the quickest manner possible. The present invention provides such a multiplier.