Demand for efficient digital narrowband and wideband speech coding techniques with a good trade-off between the subjective quality and bit rate is increasing in various application areas such as teleconferencing, multimedia, and wireless communications. Until recently, telephone bandwidth constrained into a range of 200-3400 Hz has mainly been used in speech coding applications. However, wideband speech applications provide increased intelligibility and naturalness in communication compared to the conventional telephone bandwidth. A bandwidth in the range 50-7000 Hz has been found sufficient for delivering a good quality giving an impression of face-to-face communication. For general audio signals, this bandwidth gives an acceptable subjective quality, but is still lower than the quality of FM radio or CD that operate on ranges of 20-16000 Hz and 20-20000 Hz, respectively.
A speech encoder converts a speech signal into a digital bit stream, which is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, that is, sampled and quantized with usually 16-bits per sample. The speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
Code-Excited Linear Prediction (CELP) coding is a well-known technique allowing achieving a good compromise between the subjective quality and bit rate. This coding technique is a basis of several speech coding standards both in wireless and wired applications. In CELP coding, the sampled speech signal is processed in successive blocks of L samples usually called frames, where L is a predetermined number corresponding typically to 10-30 ms. A linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically needs look ahead, e.g. a 5-15 ms speech segment from the subsequent frame. The L-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three or four resulting in 4-10 ms subframes. In each subframe, an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation. The component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of the LP filter.
In wireless systems using code division multiple access (CDMA) technology, the use of source-controlled variable bit rate (VBR) speech coding significantly improves the system capacity. In source-controlled VBR coding, the codec operates at several bit rates, and a rate selection module is used to determine the bit rate used for encoding each speech frame based on the nature of the speech frame (e.g. voiced, unvoiced, transient, background noise). The goal is to attain the best speech quality at a given average bit rate, also referred to as average data rate (ADR). The codec can operate at different modes by tuning the rate selection module to attain different ADRs at the different modes where the codec performance is improved at increased ADRs. The mode of operation is imposed by the system depending on channel conditions. This enables the codec with a mechanism of trade-off between speech quality and system capacity.
Typically, in VBR coding for CDMA systems, the eighth-rate is used for encoding frames without speech activity (silence or noise-only frames). When the frame is stationary voiced or stationary unvoiced, half-rate or quarter-rate are used depending on the operating mode. If half-rate can be used, a CELP model without the pitch codebook is used in unvoiced case and a signal modification is used to enhance the periodicity and reduce the number of bits for the pitch indices in voiced case. If the operating mode imposes a quarter-rate, no waveform matching is usually possible as the number of bits is insufficient and some parametric coding is generally applied. Full-rate is used for onsets, transient frames, and mixed voiced frames (a typical CELP model is usually used). In addition to the source controlled codec operation in CDMA systems, the system can limit the maximum bit-rate in some speech frames in order to send in-band signalling information (called dim-and-burst signalling) or during bad channel conditions (such as near the cell boundaries) in order to improve the codec robustness. This is referred to as half-rate max.
As can be seen from the above description, efficient low bit rate coding (at half-rates) is very essential for efficient VBR coding, to enable the reduction in the average data rate while maintaining good sound quality, and also to maintain a good performance when the codec is forced to operate in maximum half-rate.