The digital multiplier has become a standard feature of modern microcomputer systems and is a critical component in modern array processors and digital signal processors. Designs of these digital multipliers are based upon two basic elements: a) generation of partial products, and b) the sum-of-products reduction of the partial products to arrive at an end result. This sum-of-products reduction process has been approached in many different ways in different multiplier designs and architectures.
The basic element of the first step of multiplication is the partial product generator. A partial product generator is actually itself a multiplier, but one which can only multiply small numbers. Taking advantage of the fact that the inputs which are to be multiplied together may each be expressed as a sum of smaller numbers, we then use the relationship: EQU (A+B+C)(a+b+c)=Aa+Ab+Ac+Ba+Bb+Bc+Ca+Cb+Cc
Each of the components of the result (i.e., Aa, Ab, Ac, Ba, Bb, etc.) is called a partial product. One scheme of multiplying two binary numbers together is to view each binary input number as the sum of its component binary weighted bits. For example, the binary number 10110 (decimal 22) may be viewed as 1.times.2.sup.4 +0.times.2.sup.3 +1.times.2.sup.2 +1.times.2.sup.1 +0.times.2.sup.0. Each of these component numbers corresponds to the contribution of one of the binary number's bits to its total value. Other schemes have also been used which break up the binary number into components of more than one bit or which analyze the contributions of several different bit pairs to the same binary weighted result. For example, groups of two or more bits could be analyzed at a time and "encoded" or multiplied into partial products having a larger scope. The advantage of using a larger number of bits is that fewer partial products are generated because of the more intensive up-front reduction, requiring less sum-of-products addition to arrive at the result. The increased size and lower speed of partial product generators encompassing more bits must be weighed against the size and speed of the resulting sum-of-products addition network before an appropriate method of partial products generation is chosen.
The simplest form of partial product generation is known as "AND" gate masking. The "AND gate may be viewed as a one bit by one bit multiplier which generates a one bit result. For "AND" gate masking, the input numbers (multiplicands) are broken up into one-bit components. Another form of partial product generation well known in the art is called Booth encoding. This method generates partial products by logically analyzing more than one combination of bits at a time using a broader scope of operation and resulting in fewer partial products. In the current state of the art, there are numerous schemes for partial product generation, any of which may be applied to the present invention.
Digital multipliers typically fall into one of two categories --"array" multipliers and "tree" multipliers--the topology and characteristics of which are both well known in the current state of the art. The terms "array" and "tree" refer to the organization of that part of the multiplier which reduces (i.e., adds up) the partial products to arrive at an end result (i.e., the numerical product of the two multiplicands). These partial products are the result of the first stage of these multipliers, which actually performs the multiplication in many small pieces.
The complexities encountered in the design of parallel adders suitable for the sum-of-products application in parallel multipliers led to the development of the "Wallace Tree". Simply stated, the Wallace Tree is a logical connection of full adders (also referred to as carry save adders) which can be simply replicated and interconnected to perform the sum of products calculation. An example of a Wallace Tree is the type SN54LS275 4-bit-by-4-bit binary multiplier with 3-state outputs, 7-bit slice Wallace Tree with 3-state outputs from Texas Instruments.
"Array" multipliers are formed by the use of arrays of full adders arranged in rows such that each row of adders reduces one further partial product. Because all intermediate partial products are maintained in full parallel carry-save form, carries do not propagate (ripple). As such, the delay of the sum-of products adder array is independent of partial product width, and is dependent only upon the depth of the array. Although each row may be extremely fast, they do not make efficient use of their hardware resources, since each row of adders is used only once in any given multiplication and remains idle while its intermediate result propagates through the rest of the array. It is possible to pipeline the rows so that hardware efficiency may be improved by allowing several multiplications to be overlapped. This also improves the throughput of the resultant multiplier, but increases the overall amount of hardware required and increases the delay (latency) from input to result by the additional delay of the pipeline registers and the number of clocks required to propagate the result to the output. Generally, these constraints also apply to "tree" multipliers.
It is also possible to build iterative structures whereby pipeline registers may be employed to re-use the same row of adders iteratively. In such a structure, the amount of hardware is greatly reduced, but the latency is increased by the delays associated with the pipeline register and throughput is greatly reduced. Generally, array adders are implemented as small, but relatively slow structures. Their depth tends to be linearly proportional to the number of partial products.
Tree adders, such as the Wallace tree, are organized to reduce partial products faster than arrays. Their organization is such that the depth of the tree structure tends to be proportional to the log (base 1.5, since full adders are 3/2 adders) of the number of partial products. As such, the number of adders required is considerably fewer than for a conventional array adder. One major problem, however, has been the layout of tree adders. The tree wants to add as many partial products together that are available at any given time. Unfortunately, the bits that are ready to be summed are usually not in physical proximity with one another. As a result, the apparent savings due to fewer rows is usually lost in the cumbersome interconnections required to implement the rows. This makes the layout of tree adders extremely cumbersome and highly irregular, requiring very large wiring channels when implemented on integrated circuits. In fact, the large wiring channels and interconnect density make tree adders considerably larger than conventional arrays, even though it might appear on the surface that they would require fewer active components.
Another inherent feature of tree adders is that they do not physically lay out (in a semiconductor device) in a rectangular block. The shape of the typical tree adder is that of a parallelogram. Unfortunately, this wastes a great deal of space and causes poor chip area utilization.
FIGS. 1a and 1b contrast the topology of the conventional array with that of the tree. In FIG. 1a, an array adder 100 adds up four equally weighted inputs 102a, 102b, 102c, and 102d to arrive at an output 104. This requires four identical "rows" (or "stages") 106, 108, 110 and 112 of adders 130a, 130b, 130c and 130d. In FIG. 1b, a tree adder 120 adds up the same four equally weighted inputs 102a, 102b, 102c and 102d to arrive at a result 122, but is organized in only two non-identical "rows" 124 and 126. In like manner to the diagram of FIG. 1a, each row is made up of adders 130e, 130f and 130g. (Both of the examples of FIGS. 1a and 1b are greatly simplified by ignoring the effects of "carries in" and "carries out".)
Only rarely does a designer encounter exactly the same multiplier requirements twice. Different applications pose different requirements on multiplicand width, available space, and speed. As a result, designs must be usually be done from scratch, requiring a great deal of manual effort. It would therefore be extremely desirable to have a single multiplier architecture which allows for a very regular layout, and which can be adapted readily for different speed/size tradeoffs.