1. Field of the Invention
The invention relates to the field of logic devices. More specifically, the invention relates to the field of programmable logic devices.
2. Background Information
Fast multiplication and addition are key arithmetic operations in digital signal processing (DSP), as well as other forms of computer data processing. In DSP especially, it is often necessary to multiply several pairs of numbers and accumulate the results by addition into a single number. Mathematically, this operation is called a xe2x80x9cdot productxe2x80x9d or xe2x80x9csum of products.xe2x80x9d It can be written a1*b1+a2*b2+ . . . +an*bn, where the ai and bi sequences are paired up, and each corresponding element is multiplied, with the results accumulated. In a typical digital filter, the first sequence may be a fixed sequence of filter coefficients, while the second sequence may be a contiguous set of data samples from a longer input sequence.
Some application-specific examples of constant coefficient sums of products over low resolution data include: multi-tap constant coefficient FIR filters such as pulse shaping filters, especially over low precision data; fixed-pattern correlators (matched filters), despreaders, etc. Other applications that involve higher precision data include: fast block transforms, such as DCT, FFT, wavelets, etc.; IP checksums (reduction of 64 bits/cycle in the main body of an IP packet), etc.
The precision requirements for these multiplication and addition operations can vary tremendously, as can the desired representations of the numbers involved. For example, in some applications it is desired to use floating point number representations; in others, the fixed point representation is sufficient and is more cost effective. Among fixed point representations, the number of integral and fractional digits can vary, as can the total number of digits. Additionally, the numbers may be signed or unsigned. Beyond the data representations themselves, certain details of the processing operations are important. For example, multiplication and addition operations produce outputs with a greater number of digits than their inputs. Thus, when such operations are composed, the number of digits in the results can grow dramatically. Commonly, the exact results include digits that do not represent useful information, so some digits are discarded using truncation and rounding. The art of discarding digits that are not useful is both important and complex.
The precision requirements for the multiplication and addition operations are generally related to: the precision of the input data; the precision of the coefficients; the type of processing algorithm; and certain parameters of that algorithm such as how truncation and rounding are performed. The analysis of these requirements is sufficiently complex that a whole branch of mathematics, known as Numerical Analysis, has been developed for them.
In response to the widespread need for fast multiplication and addition with a variety of precisions and data representations, an extensive literature has been created and many hardware and software implementations have been developed. For most implementations, the complexity increases roughly as N*M where N and M are the number of bits of the two input operands. Thus, for N by N multiplication, the complexity increases as N2. Algorithms are known that reduce this complexity for very large operands, but for most applications, the operand sizes are not large enough to make these algorithms practically useful. On the other hand, many ideas have been developed that do effectively exploit properties of hardware technologies and multiplication algorithms to speed up implementations having a particular precision and numerical representation.
The straightforward approach to multiplication is adding up a set of appropriately shifted partial products, each generated by multiplying the multiplicand by one of the digits of the multiplier. The only difficulty about addition is carries between digits, since the carry out from a particular digit depends on the carry into that digit, so that the carry propagation aspect of addition is inherently sequential. Since it is possible that a carry may propagate across all the digits of a sum, the number of sequential steps required for the addition is equal to the number of digits being added. Many techniques are known for reducing the maximum number of sequential steps requires for the addition; however these techniques generally require more hardware.
Many hardware designs for fast multiplication embody an extended version of the straightforward multiplication algorithm, consisting of a first part that generates partial products, a second part that sums the partial products to two numbers (referred to as xe2x80x9ccarryxe2x80x9d or xe2x80x9cCxe2x80x9d and xe2x80x9csavexe2x80x9d or xe2x80x9cSxe2x80x9d) whose sum is the correct answer, and a third part that adds together C and S to produce the answer. The partial product generation may include any form of multiplicand preprocessing, such as Booth encoding. The numbers C and S are developed in such a way that carry propagation is largely or completely avoided during the second part. The apparatus implementing the second part is generally known as a xe2x80x9cCarry Save Adder,xe2x80x9d sometimes abbreviated xe2x80x9cCSA.xe2x80x9d Carry propagation is unavoidable during the third part of the multiplication algorithm, but only two numbers are then involved, and any of the known techniques can be used to speed up the addition. The third part of this multiplication algorithm is also called the xe2x80x9cCarry Propagate Adder,xe2x80x9d sometimes abbreviated xe2x80x9cCPAxe2x80x9d.
The variations among hardware multiplier designs of this type generally involve one or more of the following: the method for generating partial products, the method for reducing them to numbers C and S, the method for performing the final addition of numbers C and S, and the method for modifying the partial products and/or carry save adder to accommodate signed number representations.
More generally, systems applications may use several of the DSP algorithms that were just briefly described, and may use other algorithms involving multiplications and additions as well. Depending on the total throughput required by the application, it may be necessary to provide dedicated hardware multiplication and addition circuits for each operation through which data flows in fixed connection patterns, or on the other hand, it may be possible to reuse one or more hardware multiplication and addition circuits with data flows directed by a control element.
When designing dedicated hardware for performing sums of products, the hardware designer will exploit certain simplifications resulting from zero bits in the (constant) coefficients. A designer/architect will use other techniques to minimize the number of non-zero bits in the coefficients. For example, with 1-bit data interpreted as 1 and xe2x88x921, an FIR filter reduces to adding and subtracting the coefficients according to the state of each data bit. The result is that multiplications involving constant coefficients are much more efficient in dedicated hardware as opposed to xe2x80x9cgeneral purposexe2x80x9d multiplication hardware. However, programmable CPU and DSP hardware typically provides only general purpose multiplies.
In particular, CPU and DSP circuits usually implement complex numerical algorithms by the sequential composition of simpler operations into and out of register files that store intermediate results, coefficients, and so on. For example, a sum of squares of differences algorithm may be implemented by a first operation that takes the difference of two numbers, a second operation that squares the result, and a third operation that accumulates the result of the second operation into a running sum. In case each operation takes a single cycle, the algorithm would then be completed in 3 cycles.
However, since the inherent complexity of multiplication is higher than that of addition or subtraction, hardware designers often optimize the clock speed of their designs by pipelining the multiplication operation, so that it completes after more than one cycle. For example, a particular design might complete in 3 cycles but allow a new multiply operation to be started on every cycle. In such a case, the sum of squares of differences algorithm discussed earlier might complete in 5 cycles. Depending on certain details of the hardware design, it might be possible to overlap the calculation for the next pair of vector elements so that on average, each pair of elements would be subtracted, squared, and summed in 3 cycles.
While many design variations are possible that involve more or less parallel hardware, it is generally desirable for designs to require as few cycles as possible to complete an algorithm; this is especially important for the most widely used algorithms, including those mentioned in the foregoing. If a first design uses K cycles while a second uses L greater than K, then the first design is also more cost-effective if it uses less than L/K as much hardware. Implementers of systems applications are generally desirous of designs that are fast, cost-effective, and reconfigurable.
A method and apparatus for a sums of products datapath is described. According to one embodiment of the invention, an apparatus has a number of inputs and a number generation units. Each of the generation units is coupled to the inputs. Each of the generation units includes a separate selection circuit coupled to each one of the inputs to selectively pass the signal provided on that input. In addition, each of the generation units includes a number of reduction circuits having inputs coupled to mutually exclusive pluralities of the selection circuits and each having an output. The apparatus also includes a first and second summation circuit coupled to the output of the plurality of reduction circuits in mutually exclusive pluralities of generation units. Additionally, the apparatus includes a subtraction circuit coupled to an output of the first and second summation circuit.