1. Technical Field
The present invention relates in general to multiplier/accumulator (MAC) circuits, and in particular to a system and method for simplifying the reduction process in a MAC architecture. More particularly, the present invention relates to a MAC architecture having a pre-reduction mechanism for reducing the number of processing stages in the critical MAC path.
2. Description of the Related Art
Multiply-accumulate (MAC) operations are frequently undertaken in computer systems because they lend themselves to multiplying matrices, which is common in digital signal processing and video/graphics applications. An accumulator within a MAC design is a binary computation device comprising an adder/subtractor and an accumulation register. A multiplier is added to the accumulator to form the MAC architecture. Most advanced digital systems employ MAC circuits to perform high-speed parallel multiply/add operations.
Due to its computation-intensive nature and the overhead hardware required, MAC is an costly operation in terms of design expense and speed. Any improvement in the delay for performing a MAC operation would have a positive impact on allowable clock speed, instruction time, and processor performance.
With reference to FIG. 1, there is depicted a conventional MAC circuit that includes multiply and accumulate functionality for performing addition and multiplication operations in parallel. Specifically, FIG. 1 illustrates a MAC circuit 10 for multiplying a 16-bit multiplicand, Y, by a 16-bit multiplicator, X. A pair of registers, 14 and 12, store multiplicand Y and multiplicator X respectively, until MAC circuit 10 is given a xe2x80x9cmultiplyxe2x80x9d instruction whereupon registers 12 and 14 deliver the operands to the multiplication circuitry within MAC circuit 10.
The multiplication functionality within MAC circuit 10 may be divided into two stages. The first stage includes a partial product generator 16 and a reduction circuit 25, wherein partial product generation and carry-save addition (reduction) are performed. The second stage includes a carry propagate adder 26 that adds a redundant-form accumulated sum within an accumulation register 33, to a final, non-redundant accumulated sum within a final accumulation register 28.
Partial product generator 16 processes multiplicand Y and multiplicator X utilizing a modified Booth algorithm (MBA) to produce a partial product matrix (PPM) such as PPM 32 illustrated in FIG. 3A. MBA is widely adopted for partial product generation in large (16-bits and greater) multipliers.
As further depicted in FIG. 1, partial product generator 16 includes a Booth encoder 20 and Booth multiplexors 18 for implementing the MBA. A major benefit of using a MBA for producing a PPM from partial product generator 16 is that instead of generating n partial products for an n-bit multiplication, the MBA generates approximately half of that.
In accordance with well-known MBA principles, a signed binary number in its 2""-complement form is partitioned into overlapping groups of three bits. By coding each of these groups, an n-bit signed binary number is represented as a sum of n/2 signed digits. Since each signed digit assumes the possible values of 0, xc2x11, or xc2x12, the required partial products are all power-of-two multiples of the multiplicand (Y in the depicted example), which are readily available. The logic required to implement the functionality of partial product generator 16 is well-known in the art and is therefore not explicitly depicted in FIG. 1.
The individual partial product terms (represented as dots within PPM 32) are generated by partial product generator 16. With reference to FIG. 3A, the result of the Booth recoding performed by partial product generator 16 is evidenced by the eight partial products within PPM 32 for a 16-bit multiplication.
As shown in FIG. 3A, PPM 32 forms the upper portion of a reduction array 30 that is applied as an input to a reduction tree within reduction circuit 25. The lower portion of reduction array 30 comprises the redundant (carry-save) form of an accumulated sum row 38 and an accumulated carry row 37 that were formed in the previous accumulation step and which are added to the partial products within PPM 32 from the current multiplication cycle to form the current redundant accumulator result.
Between PPM 32 and accumulated rows 38 and 37, an additional row 36 consists of constant values to ensure proper sign extensions for each row of partial products within PPM 32, and also to set the sign bit of the most significant partial product that is introduced to support two""s complementation required for negative partial products in the MBA.
Turning back to FIG. 1, reduction circuit 25 includes a compressor array 24 for compressing the bit rows constituting reduction array 30 such that PPM 32 and constant value row 36 are compressed into newly accumulated sum row 38 and carry row 37. For notational and illustrative convenience with reference to FIG. 3A, constant value row 36, together with the portions of sum and carry rows 38 and 37 enclosed by block 39 may be referred to collectively as block 39.
With reference to FIG. 1 and FIG. 3A, after PPM 32 has been produced by partial product generator 16, an array of compressor circuits within compressor array 24 are employed to reduce the height of reduction array 30 at its highest point at column 34, from eleven to two. The compressor array includes circuits for performing carry-save additions in order to compress the eight rows of PPM 32 and constant value row 36 into sum row 38 and carry row 37. Counter circuits or compressor circuits are utilized to add the rows of PPM 32, constant value row 36, and sum-carry rows 38 and 37 into a newly accumulated sum-carry redundant form.
For the examples presented in FIGS. 1 and 3A, three concatenated levels of 4:2 compressor circuits are required within compressor array 24 to reduce the eleven rows in reduction array 30 to a final redundant sum-carry result. Such 4:2 compressor circuits can be constructed with three levels of two-input XOR logic gates in the critical path. Each level of logic within compressor array 24 adds to the overhead and delay of MAC circuit 10.
Compressor array 24 is a hardware intensive component within the architecture of MAC 10. The number of rows (column depth) is a major limiting factor in minimizing the required number of logic gate levels within compressor array 24 and maximizing the overall speed of MAC 10. It would therefore be useful to provide a system and method for reducing hardware overhead and accelerating the reduction process in a MAC circuit by pre-reducing the reduction array.
An apparatus and method for compressing a reduction array into an accumulated carry-save sum are disclosed herein. The reduction array includes a partial product matrix, a carry-save sum, and a constant value row. A compressor array generates a previous accumulated carry-save sum. A three-input/two-output carry-save adder pre-reduces the constant value row and the previously accumulated carry-save sum into a two-row intermediate carry-save sum that is added to the partial product matrix to form a current accumulated carry-save sum.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.