The demand for efficient digital speech and audio coding techniques with a good trade-off between subjective quality and bit rate is increasing in various application areas such as teleconferencing, multimedia, and wireless communication.
A speech coder converts a speech signal into a digital bit stream which is transmitted over a communication channel or stored in a storage medium. The speech signal to be coded is digitized, that is sampled and quantized using for example 16-bits per sample. A challenge of the speech coder is to represent the digital samples with a smaller number of bits while maintaining a good subjective speech quality. A speech decoder or synthesizer converts the transmitted or stored bit stream back to a speech signal.
Code-Excited Linear Prediction (CELP) coding is one of the best techniques for achieving a good compromise between subjective quality and bit rate. The CELP coding technique is a basis for several speech coding standards both in wireless and wireline applications. In CELP coding, the speech signal is sampled and processed in successive blocks of L samples usually called frames, where L is a predetermined number of samples corresponding typically to 10-30 ms of speech. A linear prediction (LP) filter is computed and transmitted every frame; the LP filter is also known as LPC (Linear Prediction Coefficients) filter. The computation of the LPC filter typically uses a lookahead, for example a 5-15 ms speech segment from the subsequent frame. The L-sample frame is divided into smaller blocks called subframes. In each subframe, an excitation signal is usually obtained from two components, a past excitation and an innovative, fixed-codebook excitation. The past excitation is often referred to as the adaptive-codebook or pitch-codebook excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the excitation signal is reconstructed and used as the input of the LPC filter.
In applications such as multimedia streaming and broadcast, it may be required to encode speech, music, and mixed content at low bit rate. For that purpose, encoding models have been developed which combine a CELP coding optimized for speech signals with transform coding optimized for audio signals. An example of such models is the AMR-WB+ [1], which switches between CELP and TCX (Transform Coded eXcitation). In order to improve the quality of music and mixed content, a long delay is used to allow for finer frequency resolution in the transform domain. In AMR-WB+, a so-called super-frame is used which consists of four CELP frames (typically 80 ms).
A drawback is that, although the CELP coding parameters are transmitted once every 4 frames in AMR-WB+, quantization of the LPC filter is performed separately in each frame. Also, the LPC filter is quantized with a fixed number of bits per frame in the case of CELP frames.