The present invention relates generally to the field of hardware used for performing computer arithmetic operations. More specifically, the present invention pertains to circuits capable of performing arithmetic operations on operands of signed and non-signed binary values.
Hardware circuits for performing arithmetic operations are indispensable in every computer system, cellular phone and most digital audio/video equipment. In real-time applications (e.g., simulations, speech recognition, video teleconferencing, computer games, streaming audio/video etc.), the overall system performance is heavily dependent on the speed of the multipliers. For instance, processing digital images at 30 frames/second requires nearly 2.2 million multiplication operations per second. These systems require multiplication operations be performed at a very high speed.
A technique that is commonly used in high-speed multiplier is carry-save addition. In carry-save addition, a carry-save adder (CSA) accepts three n-bit operands and generates two n-bit results which include an n-bit partial sum, and an n-bit carry. A second CSA accepts these two n-bit results and another n-bit input operand, and generates a new partial sum and carry. A CSA is therefore, capable of reducing the number of operands to be added from 3 to 2 without any carry propagation. When the number of operands is eventually reduced to 2, the operands can be added together with a carry propagate adder (CPA). Carry save addition has the advantage of minimizing delay caused by carry propagation.
FIG. 1 illustrates a parallel 5-bit-by-5-bit multiplier 100, which is comprised of CSAs 110, 120 and 130 and CPA 140. CSAs are made up of full adders (FA) and half adders (HA). As shown in FIG. 1, each multiplicand multiple is denoted by MPij which is obtained as the logical-AND of bit i of the multiplier and bit j of the multiplicand. The first-level CSA 110 is used for the first three multiplicand multiples, and at the next level, CSA 120 receives the outputs of the first level CSA 110 and adds to them the fourth multiplicand multiples. The outputs of the second level CSA 120 are provided to the third level CSA 130 adds the fifth multiplicand multiples. The outputs of the third level CSA 130 are two numbers which are then summed together by CPA 140 to generate the final product. This type of parallel multiplier is also commonly referred to as parallel-array multipliers:
Another class of parallel multiplier, proposed by C.S. Wallace in 1963, and commonly referred to as Wallace tree multipliers or simultaneous multipliers, are also commonly used. A block diagram of an exemplary six-operand Wallace tree multiplier 200 is illustrated in FIG. 2. As shown in FIG. 2, Wallace tree multiplier 200 includes CSAs 210, 220, 230 and 240 and CPA 250. The first level CSAs 210 and 220 receive the six operands and reduce them two four outputs. The second level CSA 230 receives the shifted carry output and sum output of the CSA 220 and the sum output of CSA 210 and generates two outputs, which are added to the shifted carry output of CSA 210 to generate two numbers. The two numbers which are added together by CPA 250 to generate the final product. In general, a Wallace tree multiplier has fewer levels (or stages) of carry-save adders than a straight forward parallel-array multiplier of the same width, and thus, better performance. A discussion of the parallel array multipliers and Wallace tree multipliers can be found in A. R. Omondi, xe2x80x9cComputer Arithmetic Systems: Algorithms, Architecture and Implementation,xe2x80x9d pp. 160-170, 1994.
A modification of the Wallace tree multiplier, proposed by L. Dadda in 1965 and generally called a Dadda multiplier, has less logic. There also exist a number of parallel multipliers in which other counters are used to reduce the depth of the tree. This, however, does not necessarily result in proportionate improvement in performance because the increase in the complexity of the counters also means an increase delay through them.
As the power of computer systems, cellular phones, digital A/V equipment etc. continues to increase, the demand for faster multipliers is likely to rise. At the same time, systems are also becoming increasingly integrated. Designing fast multipliers that occupy smaller areas on the integrated circuit (IC) chip is critical to a commercially successful product. Therefore, what is needed is a multiplier circuit that is faster and smaller. What is further needed is a method of reducing reduction trees such that the resulting parallel multiplier circuit uses fewer levels of CSA components than conventional multipliers.
Accordingly, the present invention provides parallel multipliers that require one fewer stage of reduction than conventional multipliers. Accordingly, fewer adder components are required. The speed of the parallel multipliers of the present invention is also improved due to the fewer number of reduction stages.
The present invention also provides a method of constructing reduction trees in parallel multipliers to achieve fewer reduction stages. In one embodiment, the method of the present invention is applicable to signed multiplication and includes a step of generating an input matrix based on the operand sizes, data types, etc., of the multiplier; and determining a maximum number of nodes per column (K) of the input matrix. The method of the present embodiment also includes the step of generating a series of maximum stage capacity numbers xcex4N including xcex4L-2, xcex4L-1, and xcex4L where xcex4L-1xe2x89xa6Kxe2x89xa6xcex4L. The method of the present embodiment then performs a step of trial reduction on an input matrix. The trial reduction step is performed in an attempt to reduce the maximum number of nodes per column of the input matrix to less than or equal to xcex4L-2. If the trial reduction step is successful in reducing the input matrix, then reduction steps are performed to generate a reduction tree that has L-1 reduction stages. Otherwise, reduction steps are performed to generate a reduction tree that has L stages.
In comparison to conventional reduction methods, such as those proposed by Wallace and Dadda, the present invention generates reduction trees that may use one fewer level of reduction stages for the same operands. Multipliers having fewer reduction stages would therefore have a smaller delay and may also require a smaller die area.
Further, in one embodiment, the method of the present invention includes a step of xe2x80x9csqueezingxe2x80x9d as many LSBs (least significant bits) of the output matrix of the reduction tree to at most one node. According to the present invention, the single nodes at the output matrix can be directly provided to the output of the multiplier without going through the final adder. Thus, a smaller final adder would be needed. Benefits of the xe2x80x9csqueezingxe2x80x9d step include (1) speeding up the circuit and (2) reducing the size of the circuit. In furtherance of one embodiment of the invention, the bit adjacent to the xe2x80x9csqueezedxe2x80x9d LSBs may have up to three nodes. Significant savings in delay and die area can be achieved because the third node may be provided to a carry input of the final adder.
Embodiments of the present invention include the above and further include a computer-readable medium having contained therein computer-readable codes for causing a computer-implemented logic synthesis system to perform a method of constructing a parallel multiplier. The method of the present embodiment includes the steps of: (a) accessing an input matrix; (b) determining a depth K of the input matrix; (c) calculating maximum stage capacity numbers for the input matrix. In the present embodiment, the maximum stage capacity numbers includes {xcex4N} where xcex4N=[xcex40, . . . xcex4L-2, xcex4L-1, xcex4L]. In accordance with the present embodiment, xcex4L-1 denotes a maximum stage capacity at a stage (L-1) and xcex4L denotes a maximum stage capacity at a stage L. Further, in the present embodiment, xcex4L-1 less than Kxe2x89xa6xcex4L. The method of the present embodiment further includes the steps of: (d) performing a trial reduction using stage (L-2) as target stage; (e) provided that a depth of the trial reduction matrix does not exceed maximum stage capacity of the stage (L-2), generating a reduction tree with (L-1) reduction stages; and (f) generating a netlist of the parallel multiplier based on the reduction tree.