The demand for efficient digital wideband speech/audio encoding techniques with a good subjective quality/bit rate trade-off is increasing for numerous applications such as audio/video teleconferencing, multimedia, and wireless applications, as well as Internet and packet network applications. Until recently, telephone bandwidths in the range of 200-3400 Hz were mainly used in speech coding applications. However, there is an increasing demand for wideband speech applications in order to increase the intelligibility and naturalness of the speech signals. A bandwidth in the range 50-7000 Hz was found sufficient for delivering a face-to-face speech quality. For audio signals, this range gives an acceptable audio quality, but is still lower than the CD (Compact Disk) quality which operates in the range 20-20000 Hz.
A speech encoder converts a speech signal into a digital bit stream that is transmitted over a communication channel (or stored in a storage medium). The speech signal is digitized (sampled and quantized with usually 16-bits per sample) and the speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
One of the best available techniques capable of achieving a good subjective quality/bit rate trade-off is the so-called CELP (Code Excited Linear Prediction) technique. According to this technique, the sampled speech signal is processed in successive blocks of L samples usually called frames where L is some predetermined number (corresponding to 10-30 ms of speech). In CELP, an LP (Linear Prediction) synthesis filter is computed and transmitted every frame. The L-sample frame is further divided into smaller blocks called subframes of N samples, where L=kN and k is the number of subframes in a frame (N usually corresponds to 4-10 ms of speech). An excitation signal is determined in each subframe, which usually comprises two components: one from the past excitation (also called pitch contribution or adaptive codebook) and the other from an innovative codebook (also called fixed codebook). This excitation signal is transmitted and used at the decoder as the input of the LP synthesis filter in order to obtain the synthesized speech.
To synthesize speech according to the CELP technique, each block of N samples is synthesized by filtering an appropriate codevector from the innovative codebook through time-varying filters modeling the spectral characteristics of the speech signal. These filters comprise a pitch synthesis filter (usually implemented as an adaptive codebook containing the past excitation signal) and an LP synthesis filter. At the encoder end, the synthesis output is computed for all, or a subset, of the codevectors from the innovative codebook (codebook search). The retained innovative codevector is the one producing the synthesis output closest to the original speech signal according to a perceptually weighted distortion measure. This perceptual weighting is performed using a so-called perceptual weighting filter, which is usually derived from the LP synthesis filter.
In LP-based coders such as CELP, an LP filter is computed then quantized and transmitted once per frame. However, in order to insure smooth evolution of the LP synthesis filter, the filter parameters are interpolated in each subframe, based on the LP parameters from the past frame. The LP filter parameters are not suitable for quantization due to filter stability issues. Another LP representation more efficient for quantization and interpolation is usually used. A commonly used LP parameter representation is the Line Spectral Frequency (LSF) domain.
In wideband coding the sound signal is sampled at 16000 samples per second and the encoded bandwidth extended up to 7 kHz. However, at low bit rate wideband coding (below 16 kbit/s) it is usually more efficient to down-sample the input signal to a slightly lower rate, and apply the CELP model to a lower bandwidth, then use bandwidth extension at the decoder to generate the signal up to 7 kHz. This is due to the fact that CELP models lower frequencies with high energy better than higher frequency. So it is more efficient to focus the model on the lower bandwidth at low bit rates. The AMR-WB Standard (Reference [1] of which the full content is hereby incorporated by reference) is such a coding example, where the input signal is down-sampled to 12800 samples per second, and the CELP encodes the signal up to 6.4 kHz. At the decoder bandwidth extension is used to generate a signal from 6.4 to 7 kHz. However, at bit rates higher than 16 kbit/s it is more efficient to use CELP to encode the signal up to 7 kHz, since there are enough bits to represent the entire bandwidth.
Most recent coders are multi-rate coders covering a wide range of bit rates to enable flexibility in different application scenarios. Again the AMR-WB Standard is such an example, where the encoder operates at bit rates from 6.6 to 23.85 kbit/s. In multi-rate coders the codec should be able to switch between different bit rates on a frame basis without introducing switching artefacts. In AMR-WB this is easily achieved since all the bit rates use CELP at 12.8 kHz internal sampling. However, in a recent coder using 12.8 kHz sampling at bit rates below 16 kbit/s and 16 kHz sampling at bit rates higher than 16 kbits/s, the issues related to switching the bit rate between frames using different sampling rates need to be addressed. The main issues are related to the LP filter transition, and the memory of the synthesis filter and adaptive codebook.
Therefore there remains a need for an efficient technique for switching LP-based codecs between two bit rates with different internal sampling rates.