1. Field of the Invention
The invention relates to the field of hardware designs for performing arithmetic operations. More specifically, the invention relates to the hardware designs for performing multiplication, sums of products, and the like.
2. Background Information
Fast multiplication and addition are key arithmetic operations in digital signal processing (DSP), as well as other forms of computer data processing. In DSP especially, it is often necessary to multiply several pairs of numbers and accumulate the results by addition into a single number. Mathematically, this operation is called a "dot product." It can be written a.sub.1 *b.sub.1 +a.sub.2 *b.sub.2 + . . . +a.sub.n *b.sub.n, where the a.sub.i and b.sub.i sequences are paired up, and each corresponding element is multiplied, with the results accumulated. In a typical digital filter, the first sequence may be a fixed sequence of filter coefficients, while the second sequence may be a contiguous set of data samples from a longer input sequence. For example, the finite impulse response (FIR) filtering operation consists of performing the dot product operation with these sequences to produce an output sample, then "shifting" the input sequence in time by one, so that the earliest sample is no longer included while one later sample is appended. The dot product operation is then repeated on the new pair of sequences to form the next output sample, and so on.
Many useful variations of this basic idea exist. For example, the filter coefficients may be changed at regular intervals, including as often as every output sample. This is commonly done, for example, in adaptive filtering, where an adaptation algorithm such as "least mean squared" (or "LMS") is used to change the filter coefficients. As another example, the input sequence may be shifted by more than one input sample between each output sample. When the filter coefficients are properly chosen, this technique can be used to reduce the sampling rate of a digitally sampled signal. As still another example, two or more sets of filter coefficients may be applied to the same input sequence in turn between each shift of the input sequence. When the filter coefficients are properly chosen, this technique can be used to increase the sampling rate of a digitally sampled signal.
Infinite impulse response (IIR) filters are also widely used in DSP. These filters employ feedback, whereby the values of previous outputs are multiplied by filter coefficients and summed with other results to produce each successive output sample. While such filters are not described as a single dot product operation, they may often be described using several dot products.
Still other important DSP algorithms use sums of products. For example, the "sum of squares of differences" operation is written as (a.sub.1 -b.sub.1).sup.2 +(a.sub.2 -b.sub.2).sup.2 + . . . (a.sub.n -b.sub.n).sup.2. This is used, for example, to measure the amount of difference between the vectors a and b, each of length n. When b represents a desired or known vector, it is common to search among a set of candidate vectors for the vector that minimizes this difference. In this case, the sum of squares of differences operation is repeated many times during the operation of the complete system.
The precision requirements for these multiplication and addition operations can vary tremendously, as can the desired representations of the numbers involved. For example, in some applications it is desired to use floating point number representations; in others, the fixed point representation is sufficient and is more cost effective. Among fixed point representations, the number of integral and fractional digits can vary, as can the total number of digits. Additionally, the numbers may be signed or unsigned. Beyond the data representations themselves, certain details of the processing operations are important. For example, multiplication and addition operations produce outputs with a greater number of digits than their inputs. Thus, when such operations are composed, the number of digits in the results can grow dramatically. Commonly, the exact results include digits that do not represent useful information, so some digits are discarded using truncation and rounding. The art of discarding digits that are not useful is both important and complex.
The precision requirements for the multiplication and addition operations are generally related to: the precision of the input data; the precision of the coefficients; the type of processing algorithm; and certain parameters of that algorithm such as how truncation and rounding are performed. The analysis of these requirements is sufficiently complex that a whole branch of mathematics, known as Numerical Analysis, has been developed for them.
In response to the widespread need for fast multiplication and addition with a variety of precisions and data representations, an extensive literature has been created and many hardware and software implementations have been developed. For most implementations, the complexity increases roughly as N*M where N and M are the number of bits of the two input operands. Thus, for N by N multiplication, the complexity increases as N.sup.2. Algorithms are known that reduce this complexity for very large operands, but for most applications, the operand sizes are not large enough to make these algorithms practically useful. On the other hand, many ideas have been developed that do effectively exploit properties of hardware technologies and multiplication algorithms to speed up implementations having a particular precision and numerical representation.
The straightforward approach to multiplication is adding up a set of appropriately shifted partial products, each generated by multiplying the multiplicand by one of the digits of the multiplier. The only difficulty about addition is carries between digits, since the carry out from a particular digit depends on the carry into that digit, so that the carry propagation aspect of addition is inherently sequential. Since it is possible that a carry may propagate across all the digits of a sum, the number of sequential steps required for the addition is equal to the number of digits being added. Many techniques are known for reducing the maximum number of sequential steps requires for the addition; however these techniques generally require more hardware.
Many hardware designs for fast multiplication embody an extended version of the straightforward multiplication algorithm, consisting of a first part that generates partial products, a second part that sums the partial products to two numbers (referred to as "carry" or "C" and "save" or "S") whose sum is the correct answer, and a third part that adds together C and S to produce the answer. The partial product generation may include any form of multiplicand preprocessing, such as Booth encoding. The numbers C and S are developed in such a way that carry propagation is largely or completely avoided during the second part. The apparatus implementing the second part is generally known as a "Carry Save Adder," sometimes abbreviated "CSA." Carry propagation is unavoidable during the third part of the multiplication algorithm, but only two numbers are then involved, and any of the known techniques can be used to speed up the addition. The third part of this multiplication algorithm is also called the "Carry Propagate Adder," sometimes abbreviated "CPA".
The variations among hardware multiplier designs of this type generally involve one or more of the following: the method for generating partial products, the method for reducing them to numbers C and S, the method for performing the final addition of numbers C and S, and the method for modifying the partial products and/or carry save adder to accommodate signed number representations.
Because of the inherent complexity of multiplication, fast multiplication hardware has commonly been developed for a single number representation at a single precision. Certain variations cause few design changes, such as signed versus unsigned numbers. For this reason, such variations are commonly found within a single hardware multiplier. Other variations can be provided by appropriate modifications of the inputs and/or outputs. For example, the position of integral and fractional parts in a fixed point multiplier can be varied by shifting inputs and outputs; smaller operands can be accommodated by padding the inputs with zero or sign digits as appropriate. This padding of input digits has significant drawbacks: since the inherent complexity of an N by N multiplier increases as N.sup.2, use of half-length operands reduces the inherent complexity by a factor of 4, which corresponds intuitively to using 1/4 of the multiplier hardware.
More generally, systems applications may use several of the DSP algorithms that were just briefly described, and may use other algorithms involving multiplications and additions as well. Depending on the total throughput required by the application, it may be necessary to provide dedicated hardware multiplication and addition circuits for each operation through which data flows in fixed connection patterns, or on the other hand, it may be possible to reuse one or more hardware multiplication and addition circuits with data flows directed by a control element. An example of the second approach is a programmable DSP chip or RISC CPU chip containing hardware multiplication circuitry. These programmable circuits usually implement complex numerical algorithms by the sequential composition of simpler operations into and out of register files that store intermediate results, coefficients, and so on. For example, a sum of squares of differences algorithm may be implemented by a first operation that takes the difference of two numbers, a second operation that squares the result, and a third operation that accumulates the result of the second operation into a running sum. In case each operation takes a single cycle, the algorithm would then be completed in 3 cycles.
However, since the inherent complexity of multiplication is higher than that of addition or subtraction, hardware designers often optimize the clock speed of their designs by pipelining the multiplication operation, so that it completes after more than one cycle. For example, a particular design might complete in 3 cycles but allow a new multiply operation to be started on every cycle. In such a case, the sum of squares of differences algorithm discussed earlier might complete in 5 cycles. Depending on certain details of the hardware design, it might be possible to overlap the calculation for the next pair of vector elements so that on average, each pair of elements would be subtracted, squared, and summed in 3 cycles.
While many design variations are possible that involve more or less parallel hardware, it is generally desirable for designs to require as few cycles as possible to complete an algorithm; this is especially important for the most widely used algorithms, including those mentioned in the foregoing. If a first design uses K cycles while a second uses L&gt;K, then the first design is also more cost-effective if it uses less than L/K as much hardware. Implementers of systems applications are generally desirous of designs that are fast, cost-effective, and reconfigurable.