The demand for efficient digital speech and audio coding techniques with a good trade-off between subjective quality and bit rate is increasing in various application areas such as teleconferencing, multimedia, and wireless communications.
A speech coder converts a speech signal into a digital bit stream which is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, that is, sampled and quantized with usually 16-bits per sample. The speech coder has the role of representing the digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
Code-Excited Linear Prediction (CELP) coding is one of the best prior art techniques for achieving a good compromise between subjective quality and bit rate. The CELP coding technique is a basis of several speech coding standards both in wireless and wireline applications. In CELP coding, the sampled speech signal is processed in successive blocks of L samples usually called frames, where L is a predetermined number of samples corresponding typically to 10-30 ms. A linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically uses a lookahead, for example a 5-15 ms speech segment from the subsequent frame. The L-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three (3) or four (4) resulting in 4-10 ms subframes. In each subframe, an excitation signal is usually obtained from two components, a past excitation and an innovative, fixed-codebook excitation. The component formed from the past excitation is often referred to as the adaptive-codebook or pitch-codebook excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the excitation signal is reconstructed and used as the input of the LP filter.
In some applications, such as music-on-hold, low bit rate speech-specific codecs are used to operate on music signals. This usually results in bad music quality due to the use of a speech production model in a low bit rate speech-specific codec.
In some music signals, the spectrum exhibits a tonal structure wherein several tones are present (corresponding to spectral peaks) and are not harmonically related. These music signals are difficult to encode with a low bit rate speech-specific codec using an all-pole synthesis filter and a pitch filter. The pitch filter is capable of modeling voice segments in which the spectrum exhibits a harmonic structure comprising a fundamental frequency and harmonics of this fundamental frequency. However, such a pitch filter fails to properly model tones which are not harmonically related. Furthermore, the all-pole synthesis filter fails to model the spectral valleys between the tones. Thus, when a low bit rate speech-specific codec using a speech production model such as CELP is used, music signals exhibit an audible quantization noise in the low-energy regions of the spectrum (inter-tone regions or spectral valleys).