A multiplication circuit or multiplier consists mainly of three parts: (1) a partial product generator made up of a matrix of AND logic gates, each operating on one bit of a multiplicand and one bit of a multiplier (here, the number, as opposed to the circuit), (2) a multiplier array (also called an adder array) made up of columns of adders which reduce the partial products by summation to two words, usually called the "sum" word and the "carry" word, and (3) a vector merging adder for adding the sum and carry words to result in one output word, the product. When multiplying two binary numbers, an M-bit multiplicand and an N-bit multiplier, M.times.N partial product terms are usually generated (although there may be some additional terms to handle negative numbers), which could alternately be thought of as N M-bit partial products, and the resulting product generally has M+N bits. In most multiplication circuits, both multiplicand and multiplier are of the same N-bit size, and the product is therefore 2N bits wide.
Multiplication circuits, when used in digital signal processors, are combined with an accumulator, so that digital filtering and other signal processing functions can be readily performed. The basic operation is ACC:=ACC+(A*B), or ACC:=ACC-(A*B). That is, typically the accumulator will add or subtract the result of the multiplication to the previous accumulated value. The accumulator is typically P bits wide, where P&gt;2N, 2N bits is the width of the multiplier product, and the leftmost (most significant) P-2N bits, called guard bits, are there to prevent overflow. U.S. Pat. No. 4,575,812 to Kloker et al. describe one such multiplier/accumulator circuit. A straightforward implementation of a multiplier/accumulator circuit has the accumulator adder follow the vector merging adder of the multiplier, so that a first addition adds the sum and carry words to form the multiplication product and then follows this with a second addition of that product with the value in the accumulator. Alternatively, the accumulator could be integrated with the multiplier by adding an extra row of adders to the multiplier array and providing the two word result to the vector merging adder. Since only one final adder has to be provided, this simplifies the design effort, and will also improve speed somewhat.
Regardless of whether a multiplier alone or a combined multiplier/accumulator circuit is being considered, the critical path that determines operating speed consists of delay through the multiplier array and delay through the final adder (plus any delay through a separate accumulator adder). The multiplier is the slowest part of a digital signal processor, so any improvement in the speed of the multiplier will improve the overall speed of the processor. High speed processing is required, for example, for implementing sophisticated speech and channel coding algorithms for digital cellular telephone communication. Another factor is layout area and regularity. A regular floorplan is easy to design and layout, whereas an irregular floorplan takes considerably more time and effort to layout. The choice of a multiplier architecture usually involves tradeoffs between area and speed. Tree multiplier architectures have a delay proportional to O(log N), whereas array multiplier architectures have a delay proportional to O(N) (where N is the word length in bits). Thus, tree architectures are faster. However, because tree multipliers require large shifts of data perpendicular to the data path, their implementation is routing intensive, requiring a larger circuit area than array multipliers. Tree architectures also tend to be very irregular in their layout.
In U.S. Pat. Nos. 5,343,417 and 5,586,071, Flora describes a Wallace tree multiplier architecture in which the columns of full adders and half adders that are used in the multiplier to reduce the partial products by successive addition to sum and carry words are chosen so that the particular inputs to be added at each adder level comply with prescribed rules that enhance the multiplier's operating speed. U.S. Pat. No. 5,181,185 to Han et al. and U.S. Pat. No. 5,504,915 to Rarick disclose other high speed parallel multipliers employing modified Wallace tree adders for summing the columns of partial products. All of these disclosed multiplication circuits illustrate the basic layout irregularity that is characteristic of tree multiplier architectures. The modified Wallace trees sacrifice some speed to obtain greater layout regularity as compared with pure Wallace tree architectures.
U.S. Pat. No. 4,901,270 to Galbi et al., and an article by G. Goto et al. in IEEE Journal of Solid-State Circuits, vol. 27, no. 9, September 1992, pages 1229-1234, describe use of four-to-two compressor adders in tree multipliers for further improving their speed. In U.S. Pat. No. 5,347,482, Williams discloses that using nine-to-three adders in a Wallace tree simplies layout and signal routing because of the larger basic building blocks of the tree, yet operates in the same number of adder delays as a three-to-two (full) adder. In U.S. Pat. No. 5,265,043, Naini et al. disclose a Wallace tree multiplier architecture that is provided with its carry-save adders arranged in a L-fold layout or floorplan in order to improve that architecture's layout regularity and reduce the required layout area.
G. J. Hekstra et al., in "A Fast Parallel Multiplier Architecture", Proceedings of IEEE Symposium on Circuits and Systems, pages 2128-2131, 1992, describe a regular array architecture with a delay proportional to O(.sqroot.N). Thus, it offers to an alternative to the compact and regular, but slow, array multiplier architecture and to the fast, but irregular and large circuit area, tree multiplier architectures, like the Wallace tree multiplier. The Hekstra multiplier architecture has an "array of arrays"-based structure consisting of a number of subarrays producing a series of partial sums feeding into a main array adding the partial sums to form the product. The main array stages consist of two rows of full adders in a four-to-two reductor configuration. The subarrays consist of rows of full adders together with the partial product generators. The sizes of the subarrays vary and have been carefully chosen to balance the propagation delays so that addends arrive at a main array stage simultaneously with the previous stage's partial sum. In Hekstra's implementation, this occurs when the sizes of the subarrays, i.e. the number of full adder rows, increase in steps of two from one subarray to the next.
An article by T. Sakuta et al. in IEEE Symposium on Low Power Electronics: Digest of Technical Papers, pages 36-37, October 1995, highlights the importance of delay balancing in order to minimize spurious transitions and thereby to minimize unnecessary power dissipation. Adders start computing at the same time without waiting for the propagation of sum and carry signals from a previous stage, so that if the addends do not arrive simultaneously at an adder, spurious transitions will result. These spurious transitions also propagate to subsequent stages, resulting in a growing number of transitions from one stage to the next. Conventional array multiplier architectures are inherently unbalanced, and thus tend to consume a lot of power. In contrast, Wallace-tree multipliers are naturally balanced due to their inherent parallel structure, and thus have a lower probability of occurrence of spurious transitions. Delay circuits could be inserted into the signal paths of any product term inputs that skip an adder ladder to synchronize them with the other inputs of corresponding adders, as taught by T. Sakuta et al. As for the aforementioned Hekstra architecture, that multiplier happens to be delay balanced only because of an appropriate selection of subarray sizes.
Although the Hekstra-type multiplier architecture is very regular in comparison with the Wallace and other tree architectures and nearly as compact as a conventional array multiplier, and is also much faster than an array multiplier, it is still somewhat slower than the tree multiplier architectures. Because of their naturally balanced parallel structure, it has been relatively easy to incorporate four-to-two, nine-to-three and other compressor adder structures into the tree multipliers without destroying its balanced signal propagation, in order to increase its operating speed. Moreover, modified tree architectures and hybrid tree-array architectures have allowed designers to improve regularity and reduce circuit area to a certain extent without sacrificing too much speed. Accordingly, where space is not at a premium, tree architectures have become the design of choice. Where small circuit area is essential, circuit designers have been forced to cope with array multipliers, despite their slow speed. The Hekstra-type multiplier is not well known and has been generally ignored. Since the one-sided architecture of adder subarrays feeding into a single main array is not inherently balanced, but rather balanced only by construction with a proper selection of subarray sizes, any modifications would require great care if balance is to be maintained.
It is an object of the present invention to provide a modified Hekstra-type multiplier architecture with improved operating speed, without sacrificing circuit area and regularity or destroying the delay balance.