This invention relates generally to digital signal processing and more particularly to a fast multiple accumulation for use in digital signal processing.
As is known, high-speed digital signal processing architectures require high speed, high throughput, multiply accumulation hardware. This is especially true for broad band communication systems where high-speed digital filters with wide dynamic range must be efficiently implemented. Typically, current multiple accumulation hardware represents a trade-off between computation latency, chip area complexity and throughput. Such multiply accumulation designs have four main elements: partial product generator, partial product reduction, carry propagation addition, and an accumulator (e.g., an adder). The partial product generator is operably coupled to receive a first and second multiplicand and produce a plurality of partial products. The partial product reduction circuit receives the plurality of partial products and reduces them into first and second reduced partial products. A carry propagation adder receives the first and second reduced partial products and produces therefrom a first resultant. The first resultant is accumulated with a previous resultant via the accumulator to produce an accumulated result. The accumulated result is stored in a register and is accumulated with a subsequent result from the carry propagation adder.
To improve the performance capabilities of a multiply accumulator, an array multiplier may used to produce the partial product and sum the partial product simultaneously. Such an architecture provides the advantages of efficiency in integrated circuit lay out and ease of pipelining by using registers at optimal points in the array multiplier. As is known, pipelining provides a high through put multiple accumulate circuit but at the expense of area and pipeline start-up delays.
The partial product reduction circuit may be implemented as a Dadda or Wallace reduction area tree module, which receives the plurality of partial products and the current multiply accumulate resultant. From these inputs, the partial tree reduction circuit produces two output operands, which are summed by an adder and stored in an output register. As such, by feeding back the accumulated output into the reduction tree circuit, the multiple accumulate function occurs within the reduction circuit. This allows for a single cycle multiple accumulate to occur. While this implementation allows for a multiply accumulate resultant to be produced in a single clock cycle, the overall speed of the multiply accumulator is limited by the speed of the partial product generation circuit and the carry propagate adder. As such, the larger the multiple accumulate function, the slower the operating cycle needs to be for the multiple accumulate circuit.
Therefore, a need exists for a method and apparatus for a fast multiply accumulation circuit that is not limited by the speed of the partial product reduction circuit as in existing multiply accumulators.