The present invention relates to computer multiplication in general and more specifically to a new low-power, partitioned multiplier implemented using a modified, symmetrical Wallace tree structure that enables the power to parts of the multiplier to be selectively turned on and off.
Processing in multimedia applications, such as, for example, applications using the International Organization for Standardization (ISO) Moving Picture Experts Group (MPEG-1) system, video and audio standard, ISO CD 11172, published June 1996, and digital signal processing require the performance of numerous multiplication operations. In fact, it has been estimated that almost 9% of all instructions in typical scientific programs are multiplication operations. As a result, high speed multipliers have become critical for the efficient performance of microprocessors.
In high performance microprocessors, the multiplier is the core functional unit that performs these multiplication computations. In general, the multiplier is a high-power consuming portion of the chip and, in recent design projects, a multiplier in a floating point unit (FPU) became the hottest spot. Because multipliers are very dense structures and have a lot of activity in their circuits, that is, a high toggling rate, multipliers generally lead to a large amount of power being dissipated within a small area, which can lead to hot spots. To achieve he desired reliability requirements of current microprocessors, designers have to balance the trade-offs between power and performance. Therefore, having a high performance and low power multiplier is increasingly important.
In general, when making tradeoffs, designers consider two main aspects, 1) which type of circuits to use, and 2) which architecture to use. On the circuit side, Domino circuits have traditionally been the preferred choice for multiplier implementation, because the Domino circuit design provides excellent performance in terms of speed. However, Domino circuits are a high-power circuit style. As microprocessors move toward designs with operating speeds in the multi-GHz range, power quickly becomes the dominating factor in choosing the design style for the microprocessor. The Domino circuit""s high consumption of power is due to the need for a clock signal, which also leads to difficulty in clock skew management. Moreover, the Domino circuit design is very noise sensitive.
On the architecture side, the popular choices for multiplier design are in the form of linear tree, array and Wallace tree structures. In many current microprocessor designs, linear tree structures have been used because of the ease of layout and pipelining. However, studies have shown that the use of the Wallace tree structure leads to shorter pipelines, which results in lower latency and higher performance. Current designers have shown innovative methods of laying out Wallace tree structures in a regular fashion and, in fact, Wallace tree is used in the Intel(copyright) Pentium(copyright) brand processors. Intel(copyright) Pentium(copyright) processors are manufactured by the Intel Corporation of Santa Clara, Calif.
FIG. 1 is a logic block diagram illustrating a known 3:2 Carry Save Adder (CSA), Wallace tree structure implementation of a Radix-4 multiplier. In general, multipliers use an array of CSAs to compute the product of 2 numbers using a xe2x80x98ripple-freexe2x80x99 carry scheme to sum the partial products. These CSAs can be implemented using any logic circuit, such as, for example, a Domino circuit or a static circuit. In each 3:2 CSA, 3 partial products are input and 2 partial products are output, that is, a sum partial product and a carry partial product are output. In general, the logic equations of a 3:2 CSA are:
Carry=a*b+b*c+c*a, where a, b and c are the three partial product inputs,
Sum=a⊕b⊕c, where ⊕stands for xe2x80x98exclusive ORxe2x80x99 (XOR).
In FIG. 1, the multiplier includes 32 CSAs, CSA 0-CSA 31, which are organized in an eight-level, asymmetrical Wallace tree structure architecture. The first level, which includes CSA 0 through CSA 10, receives 33 64-bit partial product inputs to be added. The output partial products from each of the CSAs includes the carry value and the sum value obtained from adding the 3 input partial products. An adder output 10 is shown being input to CSA 28, where the adder output 110 is used as a value, such as, for example, z to enable the performance of operations, such as, x*y+z, where x and y represent the initial input values being multiplied together and which were used to generate the partial product inputs to the multiplier. Unfortunately, the multiplier design in FIG. 1 has a high operational power requirement, especially when implemented using Domino circuits, and is configured such that all of the CSAs must operate, even when the input partial products fill the left most CSAs only with xe2x80x9c0""sxe2x80x9d and, thus, the output partial products from these CSAs do not contribute any results that will affect the final output.
Therefore, what is needed is a new circuit implementation and architecture that provides a low-power yet high-performance multiplier that is simple to implement for use in high performance processors, such as, for example, digital signal processors and microprocessors central processing units (CPUs), graphics processors, and co-processors.