In speech coding systems, reducing background noise in speech signals to improve the quality of processed speech is a primary endeavor. This fact is particularly true for lower signal to background noise ratios A typical speech coding system comprises an encoder, a transmission channel, and a decoder. Parameters for synthesizing speech signals are transmitted from the encoder over the transmission channel to the decoder. The decoder then uses the parameters to synthesize the desired speech signal.
In wireless communications systems, the most common form of speech coders use linear predictive methods. One example linear predictive method is Code Excited Linear Prediction (CELP). A general diagram of a CELP encoder 100 is shown in FIG. 1A. A CELP encoder uses a model of the human vocal tract in order to reproduce a speech input signal. The parameters for the model are actually extracted from the speech signal being reproduced, and it is these parameters that are sent to a decoder 112, which is illustrated in FIG. 1B. Decoder 112 uses the parameters in order to reproduce the speech signal. Referring to FIG. 1A, synthesis filter 104 is a linear predictive filter and serves as the vocal tract model for CELP encoder 100. Synthesis filter 104 takes an input excitation signal μ(n) and synthesizes an estimate of speech input s(n) by modeling the correlations introduced into speech by the vocal tract and applying them to the excitation signal μ(n).
In CELP encoder 100 speech is broken up into frames, usually 20 ms each, and parameters for synthesis filter 104 are determined for each frame. Once the parameters are determined, an excitation signal μ(n) is chosen for that frame. The excitation signal is then synthesized, producing a synthesized speech signal s′(n). The synthesized frame s′(n) is then compared to the actual speech input frame s(n) and a difference or error signal e(n) is generated by subtractor 106. The subtraction function is typically accomplished via an adder or similar functional component as those skilled in the art will be aware. Actually, excitation signal μ(n) is generated from a predetermined set of possible signals by excitation generator 102. In CELP encoder 100, all possible signals in the predetermined set are tried in order to find the one that produces the smallest error signal e(n). Once this particular excitation signal μ(n) is found, the signal and the corresponding filter parameters are sent to decoder 112 (FIG. 1B), which reproduces the synthesized speech signal s′(n). Signal s′(n) is reproduced in decoder 112 by using an excitation signal μ(n), as generated by decoder excitation generator 114, and synthesizing it using decoder synthesis filter 116.
By choosing the excitation signal that produces the smallest error signal e(n), a very good approximation of speech input s(n) can be reproduced in decoder 112. The spectrum of error signal e(n), however, will be very flat, as illustrated by curve 204 in FIG. 2. The flatness can create problems in that the signal-to-noise ratio (SNR), with regard to synthesized speech signal s′(n) (curve 202), may become too small for effective reproduction of speech signal s(n). This problem is especially prevalent in the higher frequencies where, as illustrated in FIG. 2, there is typically less energy in the spectrum of s′(n). In order to combat this problem, CELP encoder 100 includes a feedback path that incorporates error weighting filter 108. The function of error weighting filter 108 is to shape the spectrum of error signal e(n) so that the noise spectrum is concentrated in areas of high voice content. In effect, the shape of the noise spectrum associated with the weighted error signal ew(n) tracks the spectrum of the synthesized speech signal s′(n), as illustrated in FIG. 2 by curve 206. In this manner, the SNR is improved and the perceptual quality of the reproduced speech is increased.
If, however, speech input s(n) is noisy, then some type of noise reduction must be performed on speech input s(n) to maintain an adequate quality of voice reproduction in decoder 112. Traditional noise suppressors can reduce the background noise significantly, but they also distort the speech signal significantly due to the significant modification of the spectral envelope. As a result, the perceptual naturalness of the voiced speech signal is reduced sometimes significantly. Therefore, the requirement for noise suppression and the requirement for perceptually natural voiced signals make it difficult to effectively achieve both simultaneously.