Speech coders and decoders are conventionally provided in radio transmitters and radio receivers, respectively, and are cooperable to permit speech communications between a given transmitter and receiver over a radio link. The combination of a speech coder and a speech decoder is often referred to as a speech codec. A mobile radiotelephone (e.g., a cellular telephone) is an example of a conventional communication device that typically includes a radio transmitter having a speech coder, and a radio receiver having a speech decoder.
In conventional block-based speech coders the incoming speech signal is divided into blocks called frames. For common 4 kHz telephony bandwidth applications typical framelengths are 20 ms or 160 samples. The frames are further divided into subframes, typically of length 5 ms or 40 samples.
Conventional linear predictive analysis-by-synthesis (LPAS) coders use speech production related models. From the input speech signal, model parameters describing the vocal tract, pitch etc. are extracted. Parameters that vary slowly are typically computed for every frame. Examples of such parameters include the STP (short term prediction) parameters that describe the vocal tract in the apparatus that produced the speech. One example of STP parameters is linear prediction coefficients (LPC) that represent the spectral shape of the input speech signal. Examples of parameters that vary more rapidly include the pitch and innovation shape/gain parameters, which are typically computed every subframe.
The extracted parameters are quantized using suitable well-known scalar and vector quantization techniques. The STP parameters, for example linear prediction coefficients, are often transformed to a representation more suited for quantization such as Line Spectral Frequencies (LSFs). After quantization, the parameters are transmitted over the communication channel to the decoder.
In a conventional LPAS decoder, generally the opposite of the above is done, and the speech signal is synthesized. Postfiltering techniques are usually applied to the synthesized speech signal to enhance the perceived quality.
For many common background noise types a much lower bit rate than is needed for speech provides a good enough model of the signal. Existing mobile systems make use of this fact by adjusting the transmitted bit rate accordingly during background noise. In conventional systems using continuous transmission techniques, a variable rate (VR) speech coder may use its lowest bit rate. In conventional Discontinuous Transmission (DTX) schemes, the transmitter stops sending coded speech frames when the speaker is inactive. At regular or irregular intervals (typically every 500 ms), the transmitter sends speech parameters suitable for generation of comfort noise in the decoder. These parameters for comfort noise generation (CNG) are conventionally coded into what is sometimes called Silence Descriptor (SID) frames. At the receiver, the decoder uses the comfort noise parameters received in the SID frames to synthesize artificial noise by means of a conventional comfort noise injection (CNI) algorithm.
When comfort noise is generated in the decoder in a conventional DTX system, the noise is often perceived as being very static and much different from the background noise generated in active (non-DTX) mode. The reason for this perception is that DTX SID frames are not sent to the receiver as often as normal speech frames. In LPAS codecs having a DTX mode, the spectrum and energy of the background noise are typically estimated (for example, averaged) over several frames, and the estimated parameters are then quantized and transmitted over the channel to the decoder. FIG. 1 illustrates an exemplary prior art comfort noise encoder that produces the aforementioned estimated background noise (comfort noise) parameters. The quantized comfort noise parameters are typically sent every 100 to 500 ms.
The benefit of sending SID frames with a low update rate instead of sending regular speech frames is twofold. The battery life in, for example, a mobile radio transceiver, is extended due to lower power consumption, and the interference created by the transmitter is lowered thereby providing higher system capacity.
In a conventional decoder, the comfort noise parameters can be received and decoded as shown in FIG. 2. Because the decoder does not receive new comfort noise parameters as often as it normally receives speech parameters, the comfort noise parameters which are received in the SID frames are typically interpolated at 23 to provide a smooth evolution of the parameters in the comfort noise synthesis. In the synthesis operation, shown generally at 25, the decoder inputs to the synthesis filter 27 a gain scaled random noise (e.g., white noise) excitation and the interpolated spectrum parameters. As a result, the generated comfort noise sc(n), will be perceived as highly stationary (“static”), regardless of whether the background noise s(n) at the encoder end (see FIG. 1) is changing in character. This problem is more pronounced in backgrounds with strong variability, such as street noise and babble (e.g., restaurant noise), but is also present in car noise situations.
One conventional approach to solving this “static” comfort noise problem is simply to increase the update rate of DTX comfort noise parameters (e.g., use a higher SID frame rate). Exemplary problems with this solution are that battery consumption (e.g., in a mobile transceiver) will increase because the transmitter must be operated more often, and system capacity will decrease because of the increased SID frame rate. Thus, it is common in conventional systems to accept the static background noise.
It is therefore desirable to avoid the aforementioned disadvantages associated with conventional comfort noise generation.
According to the invention, conventionally generated comfort noise parameters are modified based on properties of actual background noise experienced at the encoder. Comfort noise generated from the modified parameters is perceived as less static than conventionally generated comfort noise, and more similar to the actual background noise experienced at the encoder.