This invention relates generally to methods and apparatus for accelerating complex, processor-intensive signal-processing algorithms, in particular algorithms in which the evaluation depends upon a single final data point.
Some complex signal processing algorithms depend upon a single final data point to produce a processed result. One such algorithm is the finite-impulse-response (FIR) filter, which is commonly found among the algorithms evaluated by a digital signal processor (DSP).
FIG. 1 is a flowchart 10 of a direct form of a conventional FIR filter. A series of N input data samples is shifted into shift registers 111 through 11N. Thus, register 111 contains a current data sample DN and registers 112 through 11N contain a set of previous data samples D4, D3, D2, and D1.
Registers 111 through 11N present their corresponding data samples D1 through DN on like-named register output lines. Data samples D1 through DN are then multiplied in a set of multiply steps 121 through 12N by a respective set of weighting coefficients C1 through CN. Finally, an adder 13 sums the resulting weighted samples to provide a filtered output sample DF, where DF=D1C1+D2C2 . . . +DNCN. Output sample DF is then loaded into an output register 14.
FIG. 2 depicts a typical hardware implementation 20 of the flowchart of FIG. 1, like-numbered elements being the same in both Figures. For ease of illustration, FIG. 2 illustrates a four-tap filter employing weighting coefficients C1-C4. The depicted example is limited to five input samples D1-D5, sample D5 being the newest and sample D1 being the eldest. A register 11, including five individual registers 111 through 115, connects to a multiplier 22 via a multiplexer 24. A register block 26 stores weighting coefficients C1 through C4 in a series of registers 261 through 264 and presents the coefficients to multiplier 22 via a second multiplexer 28.
As depicted below in Table 1, the example begins with the first (eldest) data sample D1 stored in register 115, the second data sample D2 stored in register 112, the third data sample D3 stored in register 113, and the fourth and most recent data sample D4 stored in registers 111 and 114. A new data sample D5 is then received and latched into input register 111 during the first machine cycle (Cycle 1). Multiplexers 24 and 28 then provide the respective contents of registers 111 and 261 (i.e., D5 and C1) to multiplier 22. Multiplier 22 outputs the product D5C1 to an adder 25, which stores the product D5C1 in an accumulation register 29.
Registers 112 to 115 operate as shift registers. Data sample D1 is shifted into register 112 during the time that data sample D1 is presented to multiplier 22. Thus, for the second machine cycle (Cycle 2), each data sample in shift register 11 is similarly shifted, so that data sample D1 is replaced with data sample D4, data sample D4 is replaced with data sample D3, data sample D3 is replaced with data sample D2, and data sample D2 is replaced with data sample D5 (see Table 1).
Multiplexer 24 selects the D output DOUT of register 11 while multiplexer 28 selects coefficient C2 following the foregoing multiply and shift sequence. Multiplier 22 thus supplies the product D4C2 to adder 25, which sums the product D4C2 with the product D5C1 already in accumulation register 29 and stores the sum (i.e., D4C2+D5C1) in accumulation register 29. As with data sample D5 data sample D4 is shifted into register 112 while data sample D4 is presented to multiplier 22. Each remaining register 113-115 is similarly updated, so that the contents of registers 111-115 are as depicted above for cycle three of Table 1.
The foregoing multiply, accumulate, and shift process continues until each data/coefficient pair is presented to multiplier 22 and the resulting products are summed in accumulation register 29 and then stored in an output register 14. Upon completing of the filtering of data sample D5, the contents of registers 111-115 are as depicted above for cycle four of Table 1. The filter is then prepared to receive the next data sample D6.
Filter implementation 20 requires N clock cycles to filter each data sample, or one clock cycle for each multiply-accumulate operation performed by multiplier 22 and adder 25. Since many DSP optimized microprocessors can produce the same result in N clock cycles, such an embodiment cannot be used to accelerate the microprocessor.
Some conventional systems employ multiple multiplier/adder pairs operating in parallel to reduce the requisite number of clock cycles and therefore improve speed performance. Unfortunately, such parallel systems are larger, more expensive, and require more power than their sequential counterparts. There is therefore a need for a means of reducing the time required to complete the evaluation of the FIR-filter algorithm without incurring significant increases in power usage, size, and cost.
The present invention is directed to methods and apparatus for accelerating complex signal-processing tasks, such as FIR filtering. In one embodiment, an FIR-filter accelerator is connected in parallel with a data path in a conventional DSP. The accelerator calculates and maintains a number of partial results based on a selected number of prior data samples. Each time the DSP receives a new data sample for filtering, the DSP makes use of one or more partial results from the accelerator to speed the calculation of the filtered result. The accelerator then recalculates the partial results using the new data sample in preparation for a subsequent data sample.
The filter accelerator can improve the performance of the DSP even if the accelerator hardware operates at a rate slower than that of the DSP. The accelerator can therefore be produced inexpensively by exploiting proven, mass-produced, economical technologies and materials. Moreover, the accelerator can be made relatively small, as the accelerator does not require massively parallel processing means.