1. Field of the Invention
The present invention relates generally to a multiplier for integrated circuits, very large scale integrated circuits (VLSI), and ultra large scale integrated circuits (ULSI). More particularly, the present invention relates to a high-speed multiplication technique utilizing a layout architecture requiring very little area on a chip.
2. Related Art
Digital systems use a variety of multiplication algorithms. FIG. 1A illustrates a simple example of multiplication where 5.times.3=15. A binary multiplicand is multiplied by a binary multiplier. A series of shifted partial products are formed of zeros and ones. The shifted partial products are then added, resulting in a product. When an X-bit number is multiplied by a Y-bit number in a digital system, the resulting product will be X+Y bits in length.
In simple binary multiplication, partial products are formed by adding one-bit of the multiplier at a time, starting with the least significant bit of the multiplier. One of the simplest ways to generate partial products is to use AND gates as bit-by-bit multipliers. Using this simple multiplication technique, an X bit multiplier generates X partial products.
Prior approaches employ encoding techniques to reduce the number of partial products. Reducing the number of partial products reduces the number of additions that need to be performed. Reducing the number of additions in turn reduces the number of clock cycles necessary for binary multiplication.
One prior approach employs a Booth algorithm as an encoding technique to reduce the number of partial products by at least one half. The Booth algorithm increases the speed of a multiply operation which is a time-consuming operation. FIG. 1B illustrates Booth coding, which is a powerful algorithm for signed-number multiplication. Booth coding, sometimes called Booth recoding and decoding, essentially reduces the amount of partial products needed to generate a product. An example of Booth's algorithm and implementation in hardware is explained in Digital Computer Arithmetic: Design and Implementation, J. F. Cavanagh, chapter 3, McGraw-Hill Book Company (1984)(including Wallace Trees discussed below)(hereby incorporated by reference).
Once partial products are formed, they must be added together. Several types of conventional adders exist. One conventional two input adder is a carry propagate adder (CPA). The CPA adds a sum vector to a carry vector to produce one final sum. A CPA is a type of full adder. A full adder is a circuit that generates the sum of two operand bits plus a carry-in bit. The carry-in bit represents the carry-out of a previous lower-order bit position. The full adder produces two outputs: a sum bit and a carry-out bit.
Another type of adder is a carry-save adder (CSA). The CSA is simply a number of independent full adders or half adders. Full CSAs have three inputs and half CSAs have two inputs. CSAs are useful when multiple additions are performed, which is required in high speed multiplication.
A multiplicity of CSAs save the carry propagation until all the additions are completed. Typically, a CPA is used during a final clock cycle to complete carry propagation for all additions performed by the CSAs.
FIG. 2 illustrates a block diagram of an array multiplier 201. As shown in FIG. 2, two 8-bit numbers can be multiplied together using seven CSAs and a CPA. With the hardware of FIG. 2, multiplication can be pipelined, increasing the total throughput. However, it is typically not possible to fit an array large enough to multiply two double-precision numbers on a single chip and have space left over for other arithmetic operations using the design of FIG. 2.
Thus, VLSI designers tend to use other types of multipliers such as multipass array multipliers and even/odd multipliers (both not shown). For examples of such designs see Computer Architecture A Quantitative Approach, J. Hennessy & D. Patterson, Morgan Kaufmann Publishers, Inc., Appendix A (1990)(including Wallace Trees discussed below)(hereby incorporated by reference). These popular designs, although easier to implement in VLSI, are not the fastest designs available.
One of the fastest designs for increasing the speed of multiplication is a Wallace tree. A conventional Wallace tree multiplier is illustrated in FIG. 3. As shown, the Wallace tree is a combination of CSAs and a CPA. The idea of this design is that two adds proceed in parallel or, to put it another way, most streams of data pass through full adders allowing a multiplicity of pipelined stages. Wallace trees run at a minimum of twice the speed of array multipliers discussed above.
Wallace trees look great on paper. They require fewer gates than other conventional multipliers. Nevertheless, Wallace trees are no longer the choice of VLSI designers, because they are very difficult to implement in VLSI. Designers have discovered that they do not have the simple regular structure needed for VLSI design. Consequently, conventional Wallace trees require a significant amount of area on a chip; leaving less room for other circuitry. Referring to FIG. 3, notice that stages 302, 304, 306 and 308 have no unity. Each stage of the Wallace tree is different from the next. This is part of the reason why Wallace trees are so difficult to implement in silicon. Since Wallace trees have irregular structures they also require a significant amount of effort to verify for correctness. Additionally, mistakes in layout are very difficult to detect due to the Wallace tree's irregular structure. For these reasons VLSI designers have, for the most part, chosen other designs such as binary-tree multipliers (not shown), even though they are not as fast as the Wallace trees.
In short, Wallace trees require an abundance of chip area and require a significant amount of effort to implement due to their irregular structure. Therefore, what is needed is a device as fast as a Wallace tree, but having a regular structure requiring less area on a chip and less effort to implement.