Speech is an acoustic signal produced by human vocal apparatus. Physically, speech is a longitudinal sound pressure wave. A microphone converts the sound pressure wave into an electrical signal. The electrical signal can be sampled and stored in a digital format.
Currently, sample rates used for speech applications are increasing due to the transition from “conventional” transmission systems, such as ISDN or GSM, to so-called “wideband” or even “super-wideband” transmission systems. Furthermore, more and more multi-channel approaches (in terms of more than one loudspeaker and/or more than one microphone) are entering the market (e.g., voice-controlled TV or home stereo systems). As a consequence, hardware requirements of such systems, mainly in terms of computational complexity, will increase tremendously, and a need for efficient implementations arises.
In many applications, the signal waveform of an audio or speech signal is converted into a time series of signal parameter vectors. Each parameter vector represents a sequence of the signal (signal waveform). This sequence is often weighted by means of a window. Consecutive windows generally overlap. The sequences of the signal samples have a predetermined sequence length and a certain amount of overlapping. The overlapping is predetermined by a sub-sampling rate often expressed in a number of samples. The overlapping signal vectors are transformed by means of a discrete Fourier transform (DFT) into modified signal vectors (e.g., complex spectra). The discrete Fourier transform can be replaced by another transform, such as a cosine transform, a polyphase filter bank or any other appropriate transform.
The reverse process of signal analysis, called signal synthesis, generates a signal waveform from a sequence of signal description vectors, where the signal description vectors are transformed to signal subsequences that are used to reconstitute the signal waveform. The extraction of waveform samples is followed by a transformation applied to each vector. A well-known transformation is the discrete Fourier transform (DFT). Its efficient implementation is the fast Fourier transform (FFT). The DFT projects the input vector onto an ordered set of orthogonal basis vectors. The output vector of the DFT corresponds to the ordered set of inner products between the input vector and the ordered set of orthogonal basis vectors. The standard DFT uses orthogonal basis vectors that are derived from a family of complex exponentials. To reconstruct the input vector from the DFT output vector, one must sum over the projections along the set of orthonormal basis functions.
If the magnitude and phase spectrum are well defined, it is possible to construct a complex spectrum that can be converted to a short-time speech waveform representation by means of inverse Fourier transformation (IFFT). The final speech waveform is then generated by overlapping and adding (OLA) the short-time speech waveforms.
Signal and speech enhancement describes a set of methods or techniques that are used to improve one or more speech related perceptual aspects for a human listener. A very basic system for speech enhancement, in terms of reducing echo and background noise, consists of an adaptive echo cancellation filter and a so-called post filter for noise and residual echo suppression. Both filters operate in the time domain.
A basic structure of such a system is depicted in FIG. 1. A loudspeaker 100 plays a signal 102 of a remote communication partner or signals (prompts) of a speech dialog system (not shown). A microphone 104 records a speech signal of a local speaker 106. Besides the speech components of the local speaker 106, the microphone 104 also picks up echo components originating from the loudspeaker 100 and background noise.
To get rid of the undesired components (echo and noise), adaptive filters are used. An echo cancellation filter 108 is excited with the same signal 102 that drives the loudspeaker 100, and its coefficients are adjusted such that the filter's impulse response models the loudspeaker-room-microphone system 109. If the model fits the real system 109, the filter output 110 is a good estimate of the echo components in the microphone signal 112, and echo reduction can be achieved by subtracting the estimated echo components 110 from the microphone signal 112.
Afterwards, a filter 114 in the signal path of the speech enhancement system can be used to reduce the background noise as well as remaining echo components. The filter adjusts its filter coefficients periodically and needs, therefore, estimated power spectral densities of the background noise and of the residual echo components. Finally, some further signal processing 116 might be applied, such as automatic gain control or a limiter.
The speech enhancement system with all components operating in the time domain has the advantage of introducing only a very little delay, mainly caused by the noise and residual echo suppression filter 114. The drawback of this system is the very high computational load that is caused by pure time domain processing.
The computation complexity can be reduced by a large amount (reductions of 50 to 75 percent are possible, depending on the individual setup) by using frequency domain or sub-band domain processing, as shown in FIG. 2. For such systems, all input signals 200 and 202 are transformed periodically into, e.g., the short-term Fourier domain by means of analysis filter banks 204 and 206, and all output signals are transformed back into the time domain by means of a synthesis filter bank 208. Echo reduction can be achieved by estimating echo portions 210 (filter coefficients) in the frequency domain and by subtracting (removing) the estimated echo 212 from the spectra 214 of the input signal 202 (microphone). Sub-band components of the spectra 212 of the echo signal can be estimated by weighting the (adaptively adjusted) filter coefficients with the sub-band components in the spectra 216 of the loudspeaker signal 200. Typical adaptation algorithms for adaptively adjusted filter coefficients are the least mean square algorithm (NLMS), normalized least mean square algorithm (NLMS), recursive least squares algorithm (RLS) or affine projection algorithms (see E. Hänsler, G. Schmidt: Acoustic Echo and Noise Control, Wiley, 2004, hereinafter referred to as “Hänsler”). Echo reduction is achieved by subtracting the estimated echo sub-band components 212 from the microphone sub-band components 214. Finally the echo reduced spectra are transformed 208 back into the time domain, where overlapping of the calculated time series depends on the overlapping (sub-sampling) applied to the original signal waveform when the spectra were created.
The complexity reduction comes from sub-sampling that is applied within the analysis filter banks. The highest reduction is achieved if the so-called sub-sampling rate is equal to the number of frequency supporting points (sub-bands) that are generated by the filter bank. However, as described by Hänsler, larger sub-sampling rates cause larger so-called aliasing terms that limit performance of echo cancellation filters. In digital signal processing and related disciplines, aliasing refers to an effect that causes different spectral components to become indistinguishable (or aliases of one another) when a corresponding time signal is sampled or sub-sampled.
Due to sub-sampling, an echo cancellation filter is excited with several shifted and weighted versions of a spectrum, where only one of them is the desired one. The undesired spectra hinder the adaptation of the filter. To demonstrate that behavior, two measurements are presented in FIG. 3. The loudspeaker emits white noise for these measurements (signal 300). A Hann-windowed FFT of size 256 was used in both measurements. The microphone output (the output without echo cancellation) was normalized to have a short-term power of about 0 dB. Since no local signals are used during the measurements, the aim of echo cancellation is to reduce the output signal after subtracting the estimated echo component (this signal is called the error signal) as much as possible.
If the sub-sampling rate is chosen to be 64 (a quarter of the FFT size), good echo cancellation performance can be measured (signal 304 of FIG. 3). Finally, about 40 dB of echo reduction can be achieved, which is usually more than sufficient (about 30 dB is typically enough). This setup is able to reduce the computational complexity by a large amount; however, for several applications, even higher reductions are necessary. If the sub-sampling rate would be increased to 128 (half of the FFT size), the computational complexity of the system can be reduced by a factor of 2, compared to the set up with a sub-sampling rate of 64. However, now the performance (signal 302 in FIG. 3) is not sufficient (only about 8 dB echo reduction can be achieved). The reason for that limitation is the increased aliasing terms, as noted by Hänsler.
Up to now, two extensions are known that allow reduction of aliasing terms and thus increasing the sub-sampling rate. The first extension is to use better filter banks, such as polyphase filter banks. Instead of using a simple window, such as a Hann or a Hamming window, a longer so-called low-pass prototype filter can be applied. The order of this filter is a multiple of the FFT size and can achieve arbitrarily small aliasing components (depending on the filter length). As a result, very high sub-sampling rates (they can be chosen close to the FFT order) and thus also a very low computational complexity can be achieved. However, the drawback of this solution is an increase in the delay that the analysis and the synthesis filter banks introduce. This delay is usually much higher than recommended by ITU-T and ETSI. As a result, polyphase filter banks are able to reduce the computational complexity but, because of the increased delay they introduce, they can be applied in only a few selected applications.
The second extension is to perform the FFT of the reference signal more often, compared to all other FFTs and IFFTs. This also helps to reduce the aliasing terms, now without any additional delay. With this method, the performance of the echo cancellation is not as good as with a conventional setup, i.e., with a small sub-sampling rate, but a sufficient echo reduction can be achieved, as disclosed in EP 1936939 A1.
A comparison of the conventional method as well as of the two extensions can be found in P. Hannon, M. Krini, G. Schmidt, A. Wolf: Reducing the Complexity or the Delay of Adaptive Sub-band Filtering, Proc. ESSV 2010, Berlin, Germany, 2010.
EP 1927981 A1 describes a second method which also has some relevance. With a standard short-term frequency analysis, such as a 256-FFT using a Hann window in applications such as hands-free telephone systems, a frequency resolution of about 43 Hz (distance between two adjacent (neighboring) sub-bands/frequency supporting points) can be achieved at a sampling rate of 11,025 Hz. Due to the windowing, adjacent sub-bands are not independent of each other, and the real resolution is much lower. With the described refinement method, it is possible to achieve an enhanced frequency resolution of windowed speech signals, either by reducing the spectral overlap of adjacent sub-bands or by inserting additional frequency supporting points in between. As an example, a 512-FFT short-term spectrum (high FFT order) is determined out of a few previous 256-FFT short-term spectra (low FFT order). Computing additional frequency supporting points can improve, e.g., pitch estimation schemes or noise suppression algorithms. For echo cancellation purposes, this method improves neither the speed of convergence nor the steady state performance.
In view of the foregoing, a need exists to reduce the computational complexity of frequency domain or sub-band domain based speech enhancement systems that include echo cancellation filters.