Multiplication is one of the primary operations in general-purpose microprocessors and special purpose digital signal processors. The speed of the multiplication operation often directly determines how fast the processors are capable of operating. Generally, a multiplier generates plural partial summation results and then sums the partial results to compute a product. The summation performance influences the performance of the multiplier. However, as technology progresses a user demands increasingly fast data processing systems. To satisfy the demand, the multiplier circuit grows in complexity, and, as a result, the multiplier occupies an increasingly larger area of a central processing unit (CPU) to be embedded in the data processing system. Referring to “IEEE-754 Floating Point specification”, which can be downloaded from the web site http://standards.ieee.org/, contemporary multipliers must be capable of performing a multiplication operation for 54 bits of input data. Thus, a significant amount of circuit area is required to implement a multiplier for input data of large bit width
To improve the performance of the multiplier, several techniques have been developed. Generally, multipliers with the desired capability have array structures or Wallace tree structures. A fast array multiplier performing fast multiplication is composed of a Booth encoder, a partial-product summation tree (PPST), and a final adder. The PPST is responsible for a significant portion of the total multiplication operation delay, and is implemented commonly by a Wallace tree with 3-2 counters (i.e., full adders) and 4-2 compressors (i.e., half adders) as basic cells. Hence, design of these counters and the compressors will be a significant factor for improving the performance of the multiplier. The structure of the 4-2 compressor is as follows.
FIG. 1 is a block diagram for illustrating a conventional 4-2 compressor 10. The 4-2 compressor 10 is described in “Design of high-speed low-power 3-2 counter and 4-2 compressor for fast multipliers”, by Shen-Fu Hsiao etc., Feb. 19, 1998, ELECTRONICS LETTERS, Vol. 34, No. 4, pp. 341-343. The 4-2 compressor 10 can be found in FIG. 3a of the paper.
Referring to FIG. 1, the 4-2 compressor 10 is composed of two cascaded 3-2 counters 12 and 14. Each 3-2 counter can be realized using two multiplexers. When four input data I1, I2, I3 and I4 are externally input to the 4-2 compressor 10, the first 3-2 counter 12 generates a carry-out bit Cout and a sum bit S in response to the three input data I1, I2 and I3. The carry-out bit Cout is output to a 4-2 compressor at the next stage (not shown) connected to the current 4-2 compressor 10, and the sum bit S is output to the second 3-2 counter 14, respectively. The second 3-2 counter 14 generates a carry bit Carry and a sum bit Sum as final outputs, in response to the sum bit S, the input data I4, and a carry-input bit Cin from a previous 4-2 compressor at a previous stage (not shown). As described above, the compressor 10 receives the four input data I1, I2, I3 and I4, and generates two final outputs Carry and Sum. Therefore, the compressor 10 is referred to as a “4-2 compressor”. In the 4-2 compressor 10, four multiplexers are used. On the assumption that the delay required for one of the multiplexers is 1 MUX, then it follows that the 4-2 compressor 10 requires 4 MUXs of delay.
FIG. 2 is a circuit diagram for illustrating a conventional double-pass transistor-logic (DPL) 4-2 compressor 20 to reduce the delay of the 4-2 compressor 10, and FIG. 3 is a detailed circuit diagram for illustrating a dual-railed multiplexer shown in FIG. 2. The DPL 4-2 compressor 20 is described in FIG. 3b of the Hsiao paper cited above. Referring to FIGS. 2 and 3, the 4-2 compressor 20 reduces the delay from 4 MUXs to 3 MUXs, through the use of pass-transistor multiplexers (PTMs) 21-25 and 27 composed of CMOS transmission gates (TGs). The detailed structure of the 4-2 compressor 20 is as follows.
The 4-2 compressor 20 generates two carry bits Cout and Carry, and a sum bit Sum, in response to externally input four input data I1, I2, I3 and I4, and a carry-input bit Cin from the previous 4-2 compressor (not shown). The 4-2 compressor 20 is constructed with dual rails being suitable for high speed operation. The 4-2 compressor 20 is referred to as a double-pass transistor-logic (DPL) 4-2 compressor.
At two internal nodes N1 and N2, the 4-2 compressor 20 has eight gates to be driven by the output of the pass-transistor multiplexers (PTMs) composed of CMOS transmission gates (TGs). For example, on the assumption that an internal load capacitance of a P-channel MOS transistor is Cgp and an internal load capacitance of a N-channel MOS transistor is Cgn, the internal load capacitance at the node N2 is 4Cgp+4Cgn, and the internal load capacitance at the node N2 is 4Cgp+4Cgn, respectively. In this case, the total internal load capacitance of the 4-2 compressor 20 becomes 8Cgp+8Cgn. Generally, the PMOS transistor is twice the size of the NMOS transistor. Thus, the internal load capacitance is doubled, which is expressed as 1Cgp≈2Cgn. Therefore, the total internal load capacitance of the 4-2 compressor 20 can be expressed as 24Cgn by converting it on the basis of the NMOS transistor.
Generally, pass-transistor logic is used for transmitting a source instead of driving gates. Thus, it is difficult for the 4-2 compressor 20 to show low-power and high-speed performance of the pass-transistor logic, and moreover, the power consumption and the area of the 4-2 compressor 20 increase in proportion to the increment of wiring based on the dual railed structure.
To solve these problems, a reduced internal capacitance (RIC) 4-2 compressor 30 is used, as illustrated in FIG. 4, which is described with reference to FIG. 3c of the Hsiao paper cited above. FIG. 5 is a detailed circuit diagram for illustrating an XOR/XNOR logic element shown in FIG. 4. Referring to FIGS. 4 and 5, the 4-2 compressor 30 comprises multiplexers 33 and 37 composed of transmission gates, and XOR/XNOR logic elements 31, 32, 34 and 36. The multiplexers 33 and 37 are coupled to output ports of carry bits Cout and Carry, and the XOR/XNOR logic elements 31, 32, 34 and 36 are used for generating a sum bit Sum. Each of nodes N3 and N4 of the 4-3 compressor 30 has the internal load capacitance of 3Cgp+3Cgn. Thus, a total internal load capacitance of the 4-2 compressor 30 becomes 6Cgp+6Cgn. The total internal load capacitance can be expressed as 18Cgn by converting it on the basis of the NMOS transistor. This 4-2 compressor 30 is referred to as a reduced-internal-capacitance (RIC) 4-2 compressor.
However, the nodes N3 and N4 are used for driving gates of the multiplexers 33 and 37, and used as sources of the XOR/XNOR logic elements 31, 32, 34 and 36, so that the driving gate capacity is relatively decreased. In other words, the capability of driving gates is decreased, although the amount of the internal load capacitance is reduced. Thus the efficiency of the operating time of the RIC 4-2 compressor 30 is inferior to the DPL 4-2 compressor 20. In addition, the power consumption and the area of the RIC 4-2 compressor 30 are increased in proportion to increment of wiring based on the dual-railed structure.