Demand for efficient digital narrowband and wideband speech coding techniques with a good trade-off between the subjective quality and bit rate is increasing in various application areas such as teleconferencing, multimedia, and wireless communications. Until recently, telephone bandwidth constrained into a range of 200–3400 Hz has mainly been used in speech coding applications. However, wideband speech applications provide increased intelligibility and naturalness in communication compared to the conventional telephone bandwidth. A bandwidth in the range 50–7000 Hz has been found sufficient for delivering a good quality giving an impression of face-to-face communication. For general audio signals, this bandwidth gives an acceptable subjective quality, but is still lower than the quality of FM radio or CD that operate on ranges of 20–16000 Hz and 20–20000 Hz, respectively.
A speech encoder converts a speech signal into a digital bit stream, which is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, that is, sampled and quantized with usually 16-bits per sample. The speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
Code-Excited Linear Prediction (CELP) coding is a well-known technique allowing achieving a good compromise between the subjective quality and bit rate. This coding technique is a basis of several speech coding standards both in wireless and wireline applications. In CELP coding, the sampled speech signal is processed in successive blocks of L samples usually called frames, where L is a predetermined number corresponding typically to 10–30 ms. A linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically needs a lookahead, a 5–15 ms speech segment from the subsequent frame. The L-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three or four resulting in 4–10 ms subframes. In each subframe, an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation. The component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of the LP filter.
In wireless systems using code division multiple access (CDMA) technology, the use of source-controlled variable bit rate (VBR) speech coding significantly improves the system capacity. In source-controlled VBR coding, the codec operates at several bit rates, and a rate selection module is used to determine the bit rate used for encoding each speech frame based on the nature of the speech frame (e.g. voiced, unvoiced, transient, background noise). The goal is to attain the best speech quality at a given average bit rate, also referred to as average data rate (ADR). The codec can operate at different modes by tuning the rate selection module to attain different ADRs at the different modes where the codec performance is improved at increased ADRs. The mode of operation is imposed by the system depending on channel conditions. This enables the codec with a mechanism of trade-off between speech quality and system capacity.
Typically, in VBR coding for CDMA systems, an eighth-rate is used for encoding frames without speech activity (silence or noise-only frames). When the frame is stationary voiced or stationary unvoiced, half-rate or quarter-rate are used depending on the operating mode. If half-rate can be used, a CELP model without the pitch codebook is used in unvoiced case and a signal modification is used to enhance the periodicity and reduce the number of bits for the pitch indices in voiced case. If the operating mode imposes a quarter-rate, no waveform matching is usually possible as the number of bits is insufficient and some parametric coding is generally applied. Full-rate is used for onsets, transient frames, and mixed voiced frames (a typical CELP model is usually used). In addition to the source controlled codec operation in CDMA systems, the system can limit the maximum bit-rate in some speech frames in order to send in-band signalling information (called dim-and-burst signalling) or during bad channel conditions (such as near the cell boundaries) in order to improve the codec robustness. This is referred to as half-rate max. When the rate-selection module chooses the frame to be encoded as a full-rate frame and the system imposes for example HR frame, the speech performance is degraded since the dedicated HR modes are not capable of efficiently encoding onsets and transient signals. Another HR (or quarter-rate (QR)) coding model can be provided to cope with these special cases.
As can be seen from the above description, signal classification and rate determination are very essential for efficient VBR coding. Rate selection is the key part for attaining the lowest average data rate with the best possible quality.
An adaptive multi-rate wideband (AMR-WB) speech codec was recently selected by the ITU-T (International Telecommunications Union—Telecommunication Standardization Sector) for several wideband speech telephony and services and by 3 GPP (third generation partnership project) for GSM and W-CDMA third generation wireless systems. AMR-WB codec consists of nine bit rates, namely 6.6, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05, and 23.85 kbit/s. Interoperation between CDMA-WB and AMR-WB codec is thus desirable.