1. Field of the Invention
The invention relates to arithmetic circuits in computer and digital signal processing systems, and more specifically, to multiplier circuits used for performing high-speed multiplications.
2. Description of the Related Art
Multipliers are one of the basic circuits of digital arithmetic. The speed at which a multiplier can deliver the product of two binary numbers becomes critical in certain applications where repetitive multiplications are required. Applications requiring repetitive multiplications include various digital signal processing functions, such as Finite Impulse Response (FIR) filters, and 3D rendering. Such applications require both high throughput and fast response time. The design of multipliers employed in these applications can have a significant effect on overall application performance.
Since multiplication is essentially repeated addition, it stands to reason that digital multipliers rely heavily on adder circuits. Commonly used adder circuits include the half-adder, the full-adder, and the carry-lookahead adder. The half adder takes two 1-bit inputs, and returns two outputs, a sum bit and a carry bit. A full adder returns the same outputs, but it has an extra input, known as a carry-in. The carry-in input is configured to receive a carry-out bit from an addition of lower-order bits. Because of the carry-in, full-adders can be cascaded to allow the addition of numbers larger than one bit. An adder formed by cascading several full adders is known as a ripple carry adder.
One problem with ripple carry adders is the fact that a carry generated at the lowest order bit position must be propagated through each subsequent bit position in a sequential manner. Such propagation adds a significant amount of time to the addition process. One solution to this problem is the carry-lookahead adder (CLA). In a CLA, the carry in bit is presented to each bit position in the adder, and is combined with the operand bits to either generate or propagate a carry. Therefore, the carry-in bit is not required to propagate through multiple stages sequentially as in a ripple carry adder. The CLA will require extra circuitry over a ripple carry adder. However, since the carry is not required to ripple through each stage sequentially, it can perform additions at a significantly greater speed.
Parallel array multipliers are a commonly used multiplier circuit in systems where increased performance is required. In one type of parallel array multiplier, the first step performed is the formation of a bit-product matrix. A bit-product matrix is simply an array of bit-products formed by multiplication of the individual bits of the two numbers being multiplied, a multiplicand and a multiplier. Formation of a bit-product matrix may become complicated in certain situations, such as multiplying signed numbers. In such cases, a specialized method for bit-product matrix formation may be required. Two common methods of bit-product matrix formation are the Baugh-Wooley method (as described in U.S. Pat. No. 3,866,030), and the Hatamian-Cash method. FIG. 1 is an illustration of a bit-product matrix formed by an 8-bit multiplicand and a 4-bit multiplier using the Baugh-Wooley method.
After the formation of a bit-product matrix, many multipliers simply add the rows of the matrix to obtain the final product. However, the efficiency of this process suffers as the number of bits in the multiplier and multiplicand become larger. One solution to this problem is to use a reduction scheme. Luigi Dadda proposes several such schemes in his paper entitled Some Schemes for Parallel Multipliers (1965). Each of these schemes, referred to Dadda reduction schemes, employs combinational parallel counter circuits (not to be confused with sequential, or clocked, counter circuits). These parallel counters are used to reduce the number of rows until only two rows remain, a sum row and a carry row. The sum row and carry row are then added to form the final product of the multiplication. A multiplier employing a reduction scheme will typically be significantly faster than a one that simply adds the rows of the bit-product matrix.
The reduction of a bit-product matrix is accomplished in a number of steps. For example, in one reduction scheme, a bit-product matrix formed from two 8-bit numbers using the Baugh-Wooley method will produce a matrix having two rows. Reduction of this matrix will require five steps using a Dadda reduction scheme. The first step of the reduction will involve receiving the ten-row matrix and reducing it to nine rows. The second step will reduce nine rows to six rows. The third step of the reduction reduces the matrix from six rows to four rows, the fourth step from four rows to three rows, and the fifth step from three rows to two rows. In a typical reduction unit, the entire reduction is performed in one action using combinational logic.
The final two rows are then added to form the final product of the multiplication.
Typically, the addition of the final two rows is performed by cascading several adders together. For example, if the final product is to be 16 bits wide, the final two rows may be added by cascading four 4-bit CLA""s. The CLA circuits will add the two rows, four bits at a time, from the lowest order bits to the highest order bits.
FIG. 2 is a block diagram of a parallel array multiplier employing a reduction scheme. The multiplier is configured to receive, at the bit-product matrix unit, two binary numbers: a multiplier and a multiplicand. These numbers can be any number of bits, but will typically be 8 bits, 16 bits, or other integral powers of two. The multiplier forms a bit-product matrix, which is then reduced to a two-row matrix in the reduction unit. The two rows of this matrix are then added in the addition unit. The final output of the addition unit is the product of the multiplication.
Although the bit-product matrix formation, reduction, and addition are shown as separate blocks in the figure, the internal logic of the multiplier in FIG. 2 is combinational logic, so the entire multiplication is performed in one clock cycle.
Since the multiplier of FIG. 2 performs the entire multiplication in one clock cycle, the clock cycle must be long enough to allow all operations to complete before beginning a new multiplication. This can have a limiting effect on the clock speed due to the large amount of combinational logic used. This problem is compounded for larger operands, as additional steps of reduction require additional levels of logic, resulting in a higher gate delay. The fact that the multiplier can perform only one multiplication at a time limits throughput even further. It would be desirable to create a multiplier circuit that would allow for increased throughput, and thus higher performance. One way to achieve higher throughput is with a faster clock cycle. Thus, it would also be desirable to create a multiplier circuit with a faster clock cycle.
The problems outlined above may in large part be solved by a fully pipelined parallel multiplier with a fast clock cycle, as described herein. In one embodiment, a pipelined parallel multiplier circuit utilizes each step of both the reduction process and the addition process as pipeline stages. Circuits within the multiplier include a d-type latch circuit, a half-adder circuit, a full adder circuit, and a 4-bit carry-lookahead adder (CLA) circuit. Each of these circuits is configured to generate and/or receive required logic signals and their corresponding complements. The use of these circuits enables the individual stages of a reduction scheme and an addition scheme to be used as pipeline stages. The d-type latch circuits are particularly important, as they are used to latch results from stage to stage within the multiplier, and thus dominate the hardware complexity of the multiplier. The overall scheme is generally applicable to any combination of a bit-product matrix formation and reduction scheme.
In one particular embodiment, the fully pipelined parallel multiplier includes a bit-product matrix unit, a reduction unit, and an addition unit. The first stage of the multiplier pipeline is the bit-product matrix unit. This unit is configured to receive two binary numbers, a multiplicand and a multiplier. Typically, these two numbers will be of the same bit size (i.e. the same number of bits), although other embodiments may multiply two numbers of different bit sizes. The bit-product matrix unit then forms a matrix of partial products based on these two numbers, known as a bit-product matrix. A reduction unit, configured to receive the bit-product matrix, will then begin to reduce the matrix until only two rows remain.
The reduction of the bit-product matrix is accomplished in stages, and each of these stages is utilized as an individual pipeline stage. When the intermediate results of a first multiplication operation complete a particular stage, the stage may begin to process the intermediate results of a second multiplication operation. The number of stages in the reduction is dependent upon the particular scheme used and the size of the operands that the bit-product matrix unit is configured to receive. For example, if the bit-product matrix unit is configured to receive two 8-bit numbers, the reduction will require five different stages for one particular reduction scheme. In general, the reduction scheme will include M stages of reduction, where M is logarithmically proportional to the word size of the operands, assuming both the multiplier and multiplicand are of the same bit size. The final stage of the reduction results in a matrix of two rows, typically one row of sum bits and one row of carry bits. This two-row matrix is then received by the addition unit.
The addition unit adds the two rows of the received matrix, utilizing one or more cascaded adders. Each stage of the addition is also used as a pipeline stage. The number of required stages is dependent upon both the size of the operands and the size of the adders used. For example, if two 8-bit numbers are multiplied, they will produce a 16-bit product. By cascading four 4-bit CLA""s, the final addition can be performed in four stages. The output of the final stage of the addition is the final product of the multiplication.
Thus, in various embodiments, the fully pipelined parallel multiplier may provide an advantageous increase in performance. By performing each step of the reduction and addition processes as pipeline stages, the clock cycle for each stage can be shortened. Furthermore, multiplications can be pipelined. This may result in an advantageous increase in throughput. The multiplier need not complete a multiplication before beginning another.