The present invention relates generally to the field of adaptive systems, and, more particularly, to an efficient coefficient adaptation process for adaptive filters.
An adaptive system is a system which searches for improved performance using a computational algorithm to adjust certain parameters or weights. An adaptive filter is a computational device that attempts to model the input-output relationship between two signals in real time in an interactive manner. Adaptive filters are used, for example, in communication systems for echo cancellation and line equalization. An adaptive filter is also suitable for use in real-time control systems for different kinds of applications related to real-time optimization. Adaptive signal processing is expanding in other fields as well, such as in radar, sonar, seismology, and biomedical electronics.
An adaptive filter is defined by four aspects: the input signal x(n) being processed by the filter; the structure that defines how the filter output y(n) is computed from its input signal x(n); the filter parameters that can be iteratively changed to alter the input-output relationship of the filter; and the adaptive algorithm that describes how parameters are adjusted from one time instant to the next.
An adaptive filter may be implemented as an open-loop filter or as a closed-loop filter with a performance feedback feature. The algorithm operates in an iterative manner to update the adjustable parameters with the arrival of new data and current signal performance feedback. With each iteration, the system learns more and more about the characteristics of the input signal x(n), and the signal processor makes adjustments to the current set of parameters based on the latest system performance through an error signal e(n). The optimal set of values of the adjustable parameters is then approached sequentially.
Adaptive filters are often realized as a set of program instructions running on a digital signal processor (DSP). FIG. 1 shows a general adaptive filtering process. Generally, any system with a finite number of parameters affecting how the output y(n) is computed from the input x(n) could be used as the adaptive filter 10 in FIG. 1.
The coefficient vector T(n) of the filter is defined by the equation T(n)=[t1(n)t2(n) . . . tN(n)]T. As mentioned above, the input of the filter is x(n), while the output of the filter is y(n). The desired response signal is d(n). The error signal e(n) represents the difference between the desired signal d(n) and the real output y(n). The most frequently used structure of an adaptive filter is the finite-impulse-response (FIR) filter, which is shown in FIG. 2.
In FIG. 2, the unit zxe2x88x921 is called the delay unit. The filter itself is based on convolution. A complete computing step is called a xe2x80x9ctap.xe2x80x9d FIG. 2 shows a suitable architecture because of its linearity. When performing the filtering process, the coefficient set is used to find the output through convolution using the following equation 1:                               Y          ⁡                      (            n            )                          =                              ∑                          i              =              1                        n                    ⁢                                    t              ⁡                              (                i                )                                      xc3x97                          (                              n                -                i                            )                                                          Eq        .                  xe2x80x83                ⁢        1            
The output y(n) might not be the desired output. That is, y(n) may be very close to the desired signal d(n), but not close enough. In that case, the adaptation algorithm would be executed to correct the coefficient set so that the output y(n) will gradually approach the desired signal d(n). The desired signal d(n) is unknown and changes all the time. Therefore, the adaptive filter is a real-time closed-loop feedback system which is adapting all the time in order to follow the desired signal of d(n).
In a high quality adaptive filter, the coefficient set is constantly adapting, which requires a lot of computing power, making the adaptive filter expensive. The most popular adaptation algorithm is called the normal least-mean-square algorithm (LMS or NLMS). The LMS algorithm makes use of the so-called xe2x80x9csteepest descentxe2x80x9d approach, deriving an estimation of the gradient vector based on a limited number of data samples.
This adaptation algorithm includes the processes of convergence control and coefficient adaptation. Convergence control will not be performed in every tap, meaning that the computing power associated therewith is not very high. For this reason, the present invention does not deal with optimization of the convergence control process. Coefficient adaptation, however, should be performed for all taps during each sample in order to achieve a high-performance adaptive filter. Normally, then, most of the computing power is consumed when performing coefficient adaptation.
A tap of adaptive computing includes a convolution step and a coefficient adaptation step. The task involved in tap i is shown by the following equations:
y(i)=y(ixe2x88x921)+x(i)t(nxe2x88x92i)xe2x80x83xe2x80x83eq. 2
tnew(i+1)=told(i+1)xc2x1(convergence factor)*x(i)xe2x80x83xe2x80x83eq. 3
In equation 3 above, tnew(i+1) is the new coefficient after adaptation, and told(i+1) is the old coefficient which is going to be adapted.
Because there are many taps in a sample of processing along with variable coefficients, a digital signal processor (DSP) is needed to execute the so-called LMS algorithm. FIG. 3 shows the architecture of a typical digital signal processor.
According to the algorithm associated with a typical digital signal processor, a tap of the LMS computing process includes seven steps. First, the data is loaded from the data memory DM 20 to data register DR 28. Next, the coefficient is loaded from the tap memory or coefficient memory TM 22 to coefficient register TR 30. The register ACR 36 computes coefficient adaptation using the equation tnew(i)=told(i)+(convergence factor) * DR. The new coefficient is then moved to a buffer BR 26 (BR=ACR). In the next step, ACB is moved to ACR and convolution is performed such that ACR=ACR+data*BR. Next, the contents of ACR are moved to ACB, and finally tnew(n) is moved to the coefficient memory TM 22.
The number of executing clock cycles for one tap may vary for different architectures. For example, seven steps are necessary when using a standard central processing unit (CPU). When using advanced digital signal processors based on multiplier-accumulator hardware, the number of clock cycles for one tap of convolution and coefficient adaptation could be between two and four.
The coefficient memory has two memory accesses: one read and one write. Because both memory accesses are required for coefficient adaptation, it is difficult to execute all seven steps in one clock cycle. A dual port memory cannot be used because the memory read and write addresses could depend on each other. An asynchronous memory may be used as the coefficient memory, but problems may result including relatively slow speed and high power consumption.
In an advanced available digital processor, a special double-speed clock (compared to the process clock) is applied just to the coefficient memory, and therefore the coefficient memory can be accessed twice in one process clock cycle. However, there is an obvious drawback. The memory speed was previously slower than the logic speed; if twice the memory is used per process cycle, the system speed will not be able to be half of the memory speed. Therefore, the system clock speed cannot be high.
It would be beneficial to reduce the computing power in an adaptive system. One essential problem preventing the reduction of the computing power is that a synchronous memory cannot be physically implemented to read from and write to any position in a single clock cycle. An asynchronous memory can read from and write to any position in a single CPU clock cycle, but only when the CPU clock is much slower than the memory access time.
Therefore, when using a synchronous memory, multiple computing steps must be used to execute one tap of adaptive filtering. And, when using an asynchronous memory, although one clock cycle can be used to execute all the steps in a tap, this can only be done when the system speed is low. Another possibility is to double the memory clock so that the memory can be accessed twice in one CPU clock cycle. The drawback, however, is that the double-speed memory becomes a xe2x80x9cbottleneckxe2x80x9d which limits the performance and which will be the main source of power consumption.
In modern communication and control systems, high performance adaptive filters with very low power consumption are needed. An object of the present invention is to move all computing steps into one clock cycle without introducing extra constraints or bottlenecks. The present invention accomplishes higher performance and lower power than prior art systems while providing sufficient flexibility. The present invention accomplishes the goal of one cycle solution for a complete one-tap computing process. The present invention may be implemented in hardware through either an adaptive filter accelerator in a general purpose digital signal processor (DSP) chip or through a stand-alone chip for both adaptive filtering and other normal filtering tasks.
The present invention comprises three main parts: the hardware partition, the simplified multiplication operation, and the novel method of memory read-and-write in one clock cycle using a conventional single port static random access memory (SRAM).
In summary, the present invention in one embodiment provides a system for accelerating least-mean-square algorithm-based coefficient adaptation, comprising: a data memory for storing an input signal; a coefficient memory for storing a coefficient vector; a multiplication and accumulation unit for reading the input signal from the data memory and the coefficient vector from the coefficient memory to perform convolution; and a coefficient adaptation unit separate from the multiplication and accumulation unit for reading the input signal from the data memory and for reading the coefficient vector from the coefficient memory to perform coefficient adaptation at the same time that the multiplication and accumulation unit performs the reading to produce an adapted coefficient vector which is written back into the coefficient memory for use by the multiplication and accumulation unit during a next iteration of convolution to produce an output signal, wherein each tap is executed in one machine clock cycle.
The coefficient adaptation unit may perform multiplication using an arithmetic right shift controlled by a sign-magnitude coded shift control vector. The coefficient memory may comprise an even coefficient memory and an odd coefficient memory, each storing half of the coefficient vector. The coefficient memory may have a normal memory access mode and a coefficient adaptation mode, and during the coefficient adaptation mode the coefficient memory may be read from and written to in one clock cycle. For each iteration of convolution, an updated coefficient vector is used during each tap.
The present invention in another embodiment provides a method for accelerating least-mean-square algorithm-based coefficient adaptation, comprising the steps of: (a) storing an input signal in a data memory; (b) storing a coefficient vector in a coefficient memory; (c) reading the input signal from the data memory and the coefficient vector from the coefficient memory to perform convolution; and (d) reading the input signal from the data memory and reading the coefficient vector from the coefficient memory to perform coefficient adaptation at the same time as the reading of step (c) to produce an adapted coefficient vector which is written back into the coefficient memory for use in a next iteration of convolution to produce an output signal, wherein each tap is executed in one machine clock cycle.