This invention relates generally to the field of digital signal processors. More specifically, this invention relates to a circuit architecture and method for implementing a delayed adaptive least-mean-square digital filter in a general purpose, programmable digital signal processor.
Adaptive digital filters may be used to perform many different tasks, including system identification, equalization, echo cancellation, active noise control, adaptive beamforming, and adaptive reception (i.e. in smart antennas). One method of adjusting the coefficients of an adaptive digital filter is by way of a least-mean-square (xe2x80x9cLMSxe2x80x9d) procedure, in which the filter coefficients are updated based on the error between the LMS filter output and a desired filter output.
More specifically, the error desired to be minimized is the difference between the filter""s calculated output, which is calculated by convolving the most recent known input signal sequence with the filter transfer function, and the filter""s desired output. The desired output may be based on the measured output of the system. A digital filter whose transfer function is based on a finite number of data samples is called a finite impulse response (xe2x80x9cFIRxe2x80x9d) filter.
For a filter with n coefficients, each coefficient corresponding to a tap, the system retains the most recent n samples of a data sequence and multiplies it by the n coefficients of the filter to get the calculated output. The data sequence xm includes the last n data samples x0, x1, x2, . . . , xkxe2x88x921, xk, xk+1 . . . , xnxe2x88x921, the most recent retained data sample being x0, and the FIR filter includes coefficients h0, h1, h2, . . . , hkxe2x88x921, hk, hk+1 . . . , hnxe2x88x921. Every time a data sample is taken (in a telephone system that samples a data signal at 8 kHz, this occurs every 125 xcexcs), the LMS procedure requires two main steps that involve each data sample and coefficient: (1) calculating the filter output and (2) updating the coefficients. (Hereinafter, the time period between data sample acquisitions will be referred to as a xe2x80x9cframe.xe2x80x9d) The filter output is calculated by multiplying the data sequence samples by the FIR coefficients, i.e.   y  =            ∑              k        =        0                    n        -        1              ⁢                  x        k            ⁢                        h          k                .            
This requires n multiplications and n additions (x0*h0+x1*h1+x2*h2, etc.). The updating of coefficients requires two substeps. First, an update term is calculated by multiplying each data sample xk by a fraction xcex2 of the error (i.e. xk*xcex2e). Next, the corresponding coefficient is updated by adding the update term to the old coefficient (e.g. hk(new)=hk(old)+xk*xcex2e). This coefficient updating also requires n multiplications and n additions. Because the calculation to determine the xcex2e term can be performed independently of the updating routine, this multiplication does not need to be performed for each individual coefficient.
In an attempt to simplify memory accesses and minimize power, some conventional implementations perform a xe2x80x9cdelayedxe2x80x9d version of the LMS procedure, in which the data sample acquired during the previous frame and the error based on the data samples retained during the previous frame are used to update the coefficients (e.g. hk(new)=hk(old)+xk+1*xcex2eprev). Conventional digital signal processor (DSP) filter architectures that perform this LMS procedure include an arithmetic logic unit (xe2x80x9cALUxe2x80x9d) and a multiply and accumulate unit (xe2x80x9cMACxe2x80x9d). The ALU is capable of performing addition, subtraction, or boolean algebra on two numbers and placing the result in an accumulator. The MAC is capable of multiplying two numbers, adding this result to another number, and placing the result in an accumulator. To calculate the filter output as well as to update the coefficients, two multiplications and two additions are required to be performed for each tap. Because there is only one multiplier available, two cycles of the clock must be used for each tap. For example, in an n-tap filter, xk is kept in the data memory buffer and hk is kept in the coefficient memory buffer. The error term, xcex2eprev, is calculated based on the previous frame""s data samples and is stored in a temporary register because its value is constant for all n taps. The first cycle of the LMS procedure takes xk+1 from the data memory buffer and uses the multiplier of the MAC to calculate the update term, xk+1*xcex2eprev. That update term is stored in a first accumulator. The other cycle of the LMS procedure uses the multiplier of the MAC to calculate the part of the FIR output due to data sample xk, xk*hk, and that result is stored in a second accumulator. This cycle also uses the ALU to add the contents of the first accumulator (which holds the update term) to the coefficient hk, and the result is put back into the first accumulator. Then, at the beginning of the first cycle of the LMS procedure for the next tap, the contents of the first accumulator are stored in the coefficient memory buffer, writing over the old hk and leaving the first accumulator to store the update term corresponding to xk and hkxe2x88x921.
Thus, the LMS procedure requires two clock cycles for each coefficientxe2x80x94one for the coefficient update term multiplication and one for the FIR output multiplication and coefficient update addition. Because this LMS procedure is constantly being performed, any savings in the numbers of clock cycles that it takes could result in significant time and power savings.
Although application-specific LMS filters may implement filters that reduce the number of clock cycles from two cycles per coefficient, a need has arisen for an improved adaptive LMS digital filter which performs the LMS procedure in a programmable digital signal processor in one clock cycle. In accordance with the present invention, a method for implementing a delayed adaptive LMS filter in a programmable DSP, in which the filter has one filter coefficient per tap and acquires a new data sample each frame, includes calculating an FIR filter output and updating the filter coefficients using an error term based on the FIR filter output calculated during the preceding frame. The calculations for each tap are performed in a single clock cycle.
Preferably, the FIR filter output is calculated by multiplying, in each clock cycle, a data sample and a corresponding coefficient and accumulating the products. Each filter coefficient is preferably updated by multiplying, in each clock cycle, a data sample and the error term to form an update term, and adding the update term to the coefficient. Preferably, the error term includes an adaptation gain. Preferably, the error term is the difference between a desired output and the FIR filter output calculated during the preceding frame. The desired output is preferably based on a system output value measured during the preceding frame.
Also in accordance with the present invention is a method for implementing a one-clock-cycle-per-tap delayed adaptive least-mean-square filter in a programmable DSP, in which the filter acquires a new data sample each frame. This method includes reading a coefficient from a coefficient buffer, reading from a data buffer a first data sample which corresponds to the coefficient, multiplying the coefficient by the first data sample and accumulating the product in a register to form an FIR filter output, updating the coefficient by adding to the coefficient the product of an error term, calculated during the preceding frame, and a second data sample, acquired during the frame preceding the frame in which the first data sample was acquired, and writing the immediately preceding coefficient to the coefficient buffer. Preferably, the error term includes an adaptation gain. Preferably, the error term is the difference between a desired output and the FIR filter output calculated during the preceding frame. The desired output is preferably based on a system output value measured during the preceding frame.
In another embodiment of this method, in addition to reading the first data sample from the data buffer, a second data sample, acquired during the frame preceding the frame in which the first data sample was acquired, is also read. The updated coefficient is then formed by adding to the coefficient the product of the second data sample and the error term.
Also in accordance with the present invention is a circuit architecture in a programmable DSP for implementing a delayed adaptive LMS filter in one clock cycle per tap, in which the filter acquires a new data sample each frame. The circuit includes two multiply and accumulate circuits (MACs) and an arithmetic logic unit (ALU). The first MAC multiplies a data sample and a corresponding coefficient to generate an FIR filter output. The second MAC multiplies the data sample and an error term, calculated during the preceding frame, to generate a current clock cycle update term. The ALU sums the previous cycle""s update term and the coefficient in order to update the coefficient during the next clock cycle. Preferably, the circuit architecture also includes a data buffer to hold data samples and a coefficient buffer to hold the current values of filter coefficients. The coefficient buffer is preferably a random access memory (RAM) that can be accessed at least twice in one clock cycle. Such a RAM could be a dual-access RAM (DARAM), a dual-port RAM, or banked memory.
In another embodiment of this circuit architecture, the circuit includes only the two MACs and does not include an ALU. The first MAC operates as before, i.e. multiplying the first data sample and the corresponding coefficient to generate the FIR filter output. The second MAC multiplies a second data sample, acquired during the frame preceding the frame in which the first data sample was acquired, and the error term to generate the update term and then sums the update term and the coefficient in order to update the coefficient during the next clock cycle. This embodiment preferably includes both a coefficient buffer and a data buffer, and both of these buffers are preferably RAM that is able to be accessed at least twice in one clock cycle.
The present invention provides various advantages. One advantage is that the LMS filter uses the two MACs in a general purpose, programmable DSP architecture to perform the LMS procedure in a single clock cycle. As compared with conventional devices, which included only one MAC, the procedure is performed approximately twice as efficiently. Higher efficiency leads to lower power consumption. Another advantage is that one embodiment of the present invention does not require an ALU, leading to savings in hardware space and/or power over conventional devices which required an ALU. Moreover, implementing this filter in general purpose, programmable DSP modules saves money over implementations using application-specific integrated circuits.
Other technical advantages of the present invention will be readily apparent to one skilled in the art from the following figures, description, and claims.