In the field of signal processing, especially digital signal processing, many of the necessary operations are of the form of a finite impulse response (FIR) filter, also known as a weighted average. In this well-known operation, a finite set of values, called filter coefficients or tap weights, h(k), for k=0, . . . , N−1, and the values of an input data sequence, x(k), are used to create output sequence values, y(n), by the rule y(n)=Σk=0N−1h(k)x(n−k). Because each time n is incremented by 1, the selected set of input values is shifted by 1; this process is also called a sliding window sum. To calculate each y(n), pairs of coefficients and input values are first multiplied and then added to the sum, a process termed multiply-accumulate (MAC).
FIR operations are used extensively in signal processing to select desired frequencies, remove noise, and detect radar signals, among other applications. As the form of the equation shows, FIR filtering operations are well-suited for implementation on computer hardware. In one such implementation, the filter coefficients are loaded into a dedicated memory array, then for each value y(n), the corresponding portion of the inputs are loaded into a second memory array, and the MAC operation is performed pairwise on the aligned coefficients and inputs.
Though implementing FIR operations can be done on a general purpose computer through software, and often is, many signal processing applications require very fast computations of the FIR operations. These cases often require dedicated implementation on special purpose digital hardware, such as digital signal processors (DSP), or on reconfigurable platforms such as field programmable gate arrays (FPGA), or on application specific integrated circuits (ASIC). At this level, the specific details of hardware implementation, such as how the values are represented and internally stored, and their data type, data bus sizes, etc., become important for obtaining very high speed FIR operations. One goal for efficient hardware implementation is to have a MAC operation occur on every cycle. Achieving even higher MAC rates is especially worthwhile.
A general method and system, known in the art, for achieving fast FIR operations is shown in FIG. 1. Signal data or coefficients are moved from the system's memory through an address generator (AG) and stored in the system's quickly accessible memory locations, called the register file (Reg File). On each cycle, two values are moved from the Reg File into the MAC unit and their product calculated, summed into the accumulated value and written back into the accumulation register location.
For normal ongoing operation there must be a balance between the amount of data being read into the register file as is consumed by the MAC unit. Further, data values going into the MAC must be complete; if there is a delay accessing a data value necessary for the MAC, then the MAC must wait a cycle (or more) until it obtains a complete data value for the multiply and accumulate calculation. Such a pause is called a bubble cycle. It represents an inefficiency in the overall operation of the system. Preventing such inefficiency is an overall goal of the present invention. Another goal of the present invention is to achieve a rate of more than one MAC operation per cycle.