Audio encoders and decoders are used for a wide variety of applications in communication, multimedia and storage systems. An audio encoder is used for encoding audio signals, like speech, in particular for enabling an efficient transmission or storage of the audio signal, while an audio decoder constructs a synthesized signal based on a received encoded signal. A pair of an audio encoder and an audio decoder is referred to as an audio codec.
When implementing an audio codec, it is thus an aim to save transmission and storage capacity while maintaining a high quality of the synthesized audio signal. Also robustness in respect of transmission errors is important, especially with mobile and voice over internet protocol (VoIP) applications. On the other hand, the complexity of the audio codec is limited by the processing power of the application platform.
A speech codec (including a speech encoder and a speech decoder) may be seen as an audio codec that is specifically tailored for encoding and decoding speech signals. In a typical speech encoder, the input speech signal is processed in segments, which are called frames. Typically the frame length is from 10 to 30 ms, whereas a lookahead segment covering e.g. 5-15 ms in the beginning of the immediately following frame may be available for the coder in addition. The frame length may be fixed (e.g. to 20 ms) or the frame length may be varied from frame to frame. A frame may further be divided into a number of sub frames. For every frame, the speech encoder determines a parametric representation of the input signal. The parameters are quantized and transmitted through a communication channel or stored in a storage medium in a digital form. At the receiving end, the speech decoder constructs synthesized signal based on the received parameters.
The construction of the parameters and the quantization are usually based on codebooks, which contain codevectors optimized for the respective quantization task. In many cases, high compression ratios require highly optimized codebooks. Often the performance of a quantizer can be improved for a given compression ratio by using prediction from one or more previous frames and/or from one or more following frames. Such a quantization will be referred to in the following as predictive quantization, in contrast to a non-predictive quantization which does not rely on any information from preceding frames. A predictive quantization exploits a correlation between a current audio frame and at least one neighboring audio frame for obtaining a prediction for the current frame so that for instance only deviations from this prediction have to be encoded. This requires dedicated codebooks.
Predictive quantization, however, might result in problems in case of errors in transmission or storage. With predictive quantization, a new frame cannot be decoded perfectly, even when received correctly, if at least one preceding frame on which the prediction is based is erroneous or missing. It is therefore useful to apply a non-predictive quantization instead of predictive one once in a while, e.g. at predefined intervals (of fixed number of frames), in order to prevent long runs of error propagation. For such an occasional non-predictive quantization, which is also referred to as “safety-net” quantization, one or more selection criteria may be applied to select one of predictive quantization and non-predictive quantization on frame-by-frame basis to limit the error propagation in case of a frame erasure.