In linear predictive type speech coders such as Code Excited Linear Prediction (CELP) speech coders, the incoming original speech signal is typically divided into blocks called frames. A typical frame length is 20 milliseconds or 160 samples, which frame length is commonly used in, for example, conventional telephony bandwidth cellular applications. The frames are typically divided further into subframes, which subframes often have a length of 5 milliseconds or 40 samples.
In conventional speech coders such as mentioned above, parameters describing the vocal tract, pitch, and other features are extracted from the original speech signal during the speech encoding process. Parameters that vary slowly are computed on a frame-by-frame basis. Examples of such slowly varying parameters include the so called short term predictor (STP) parameters that describe the vocal tract. The STP parameters define the filter coefficients of the synthesis filter in linear predictive speech coders. Parameters that vary more rapidly, for example, the pitch, and the innovation shape and innovation gain parameters are typically computed for every subframe.
After the parameters have been computed, they are then quantized. The STP parameters are often transformed to a representation more suitable for quantization such as a line spectrum frequency (LSF) representation. The transformation of STP parameters into LSF representation is well known in the art.
Once the parameters have been quantized, error control coding and checksum information is added prior to interleaving and modulation of the parameter information. The parameter information is then transmitted across a communication channel to a receiver wherein a speech decoder performs basically the opposite of the above-described speech encoding procedure in order to synthesize a speech signal which resembles closely the original speech signal. In the speech decoder, postfiltering is commonly applied to the synthesized speech signal to enhance the perceived quality of the signal.
Speech coders which use linear predictive models such as the CELP model are typically very carefully adapted to the coding of speech, so the synthesis or reproduction of non-speech signals such as background noise is often poor in such coders. Under poor channel conditions, for example when the quantized parameter information is distorted by channel errors, the reproduction of background noise deteriorates even more. Even under clean channel conditions, background noise is often perceived by the listener at the receiver as a fluctuating and unsteady noise. In CELP coders, the reason for this problem is mainly the mean squared error (MSE) criterion conventionally used in the analysis-by-synthesis loop in combination with bad correlation between the target and synthesized signals. Under poor channel conditions, the problem is, as mentioned, even worse, because the level of the background noise fluctuates greatly. This is perceived by the listener as very annoying because the background noise level is expected to vary quite slowly.
One solution for improving the perceived quality of background noise in both clean and noisy channel conditions could include the use of voice activity detectors (VADs) which make a hard (e.g., yes or no) decision regarding whether the signal that is being coded is speech or non-speech. Based on the hard decision, different processing techniques can be applied in the decoder. For example, if the decision is non-speech, then the decoder can assume that the signal is background noise, and can operate to smooth out the spectral variations in the background noise. However, this hard decision technique disadvantageously permits the listener to hear the decoder switch between speech processing actions and non-speech processing actions.
In addition to the aforementioned problems, the reproduction of background noise is degraded even more at lowered bit rates (for example, below 8 kb/s). Under bad channel conditions at lowered bit rates, the background noise is often heard as a fluttering effect caused by unnatural variations in the level of the decoded background noise.
It is therefore desirable to provide for reproduction of background noise in a linear predictive speech decoder such as a CELP decoder, while avoiding the aforementioned undesirable listener perceptions of the background noise.
The present invention provides improved reproduction of background noise. The decoder is capable of gradually (or softly) increasing or decreasing the application of energy contour smoothing to the signal that is being reconstructed. Thus, the problem of background noise reproduction can be addressed by smoothing the energy contour without the disadvantage of a perceptible activation/deactivation of the energy contour smoothing operations.