Demand for efficient digital narrowband and wideband speech coding techniques with a good trade-off between the subjective quality and bit rate is increasing in various application areas such as teleconferencing, multimedia, and wireless communications. Until recently, telephone bandwidth constrained into a range of 200-3400 Hz has mainly been used in speech coding applications. However, wideband speech applications provide increased intelligibility and naturalness in communication compared to the conventional telephone bandwidth. A bandwidth in the range 50-7000 Hz has been found sufficient for delivering a good quality giving an impression of face-to-face communication. For general audio signals, this bandwidth gives an acceptable subjective quality, but is still lower than the quality of FM radio or CD that operate in the ranges of 20-16000 Hz and 20-20000 Hz, respectively.
A speech encoder converts a speech signal into a digital bit stream that is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, that is, sampled and quantized with usually 16-bits per sample. The speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
Code-Excited Linear Prediction (CELP) coding is one of the best prior art techniques for achieving a good compromise between the subjective quality and bit rate. This coding technique constitutes a basis for several speech coding standards both in wireless and wire line applications. In CELP coding, the sampled speech signal is processed in successive blocks of L samples usually called frames, where L is a predetermined number corresponding typically to 10-30 ms. A linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically needs a lookahead, i.e. a 5-15 ms speech segment from the subsequent frame. The L-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three or four resulting in 4-10 ms subframes. In each subframe, an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation. The component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of the LP filter.
In wireless systems using Code Division Multiple Access (CDMA) technology, the use of source-controlled variable bit rate (VBR) speech coding significantly improves the capacity of the system. In source-controlled VBR coding, the codec operates at several bit rates, and a rate selection module is used to determine which bit rate is used for encoding each speech frame based on the nature of the speech frame (e.g. voiced, unvoiced, transient, background noise, etc.). The goal is to attain the best speech quality at a given average bit rate, also referred to as average data rate (ADR). The codec can operate with different modes by tuning the rate selection module to attain different ADRs in the different modes of operation where the codec performance is improved at increased ADRs. The mode of operation is imposed by the system depending on channel conditions. This enables the codec with a mechanism of trade-off between speech quality and system capacity. In CDMA systems (e.g. CDMA-one and CDMA2000), typically 4 bit rates are used and they are referred to as full-rate (FR), half-rate (HR), quarter-rate (QR), and eighth-rate (ER). In this system two rate sets are supported referred to as Rate Set I and Rate Set II. In Rate Set II, a variable-rate codec with rate selection mechanism operates at source-coding bit rates of 13.3 (FR), 6.2 (HR), 2.7 (QR), and 1.0 (ER) kbit/s, corresponding to gross bit rates of 14.4, 7.2, 3.6, and 1.8 kbit/s (with some bits added for error detection).
Typically, in VBR coding for CDMA systems, the eighth-rate is used for encoding frames without speech activity (silence or noise-only frames). When the frame is stationary voiced or stationary unvoiced, half-rate or quarter-rate are used depending on the mode of operation. When half-rate is used for the stationary unvoiced frames, a CELP model without the pitch codebook is used. When the half-rate is used in case of stationary voiced frames, signal modification is used to enhance the periodicity and reduce the number of bits for the pitch indices. If the mode of operation imposes a quarter-rate, no waveform matching is usually possible as the number of bits is insufficient and some parametric coding is generally applied. Full-rate is used for onsets, transient frames, and mixed voiced frames (a typical CELP model is usually used). In addition to the source controlled codec operation in CDMA systems, the system can limit the maximum bit rate in some speech frames in order to send in-band signaling information (called dim-and-burst signaling) or during bad channel conditions (such as near the cell boundaries) in order to improve the codec robustness. This is referred to as half-rate max. When the rate selection module chooses the frame to be encoded as a full-rate frame and the system imposes for example HR frame, the speech performance is degraded since the dedicated HR modes are not capable of efficiently encoding onsets and transient signals. Another generic HR coding model is designed to cope with these special cases.
An adaptive multi-rate wideband (AMR-WB) speech codec was adopted by the ITU-T (International Telecommunications Union—Telecommunication Standardization Sector) for several wideband speech telephony and services and by 3GPP (Third Generation Partnership Project) for GSM and W-CDMA third generation wireless systems. AMR-WB codec consists of nine bit rates, namely 6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05, and 23.85 kbit/s. Designing an AMR-WB-based source controlled VBR codec for CDMA systems has the advantage of enabling the interoperation between CDMA and other systems using the AMR-WB codec. The AMR-WB bit rate of 12.65 kbit/s is the closest rate that can fit in the 13.3 kbit/s full-rate of Rate Set II. This rate can be used as the common rate between a CDMA wideband VBR codec and AMR-WB to enable the interoperability without the need for transcoding (which degrades the speech quality). Lower rate coding types must be designed specifically for the CDMA VBR wideband solution to enable an efficient operation in the Rate Set II framework. The codec then can operate in few CDMA-specific modes using all rates but it will have a mode that enables interoperability with systems using the AMR-WB codec.
In VBR coding based on CELP, typically all classes, except for the unvoiced and inactive speech classes, use both a pitch (or adaptive) codebook and an innovation (or fixed) codebook to represent the excitation signal. Thus the encoded excitation consists of the pitch delay (or pitch codebook index), the pitch gain, the innovation codebook index, and the innovation codebook gain. Typically, the pitch and innovation gains are jointly quantized, or vector quantized, to reduce the bit rate. If individually quantized, the pitch gain requires 4 bits and the innovation codebook gain requires 5 or 6 bits. However, when jointly quantized, 6 or 7 bits are sufficient (saving 3 bits per 5 ms subframe is equivalent to saving 0.6 kbit/s). In general, the quantization table, or codebook, is trained using all types of speech segments (e.g. voiced, unvoiced, transient, onset, offset, etc.). In the context of VBR coding, the half-rate coding models are usually class-specific. So different half-rate models are designed for different signal classes (voiced, unvoiced, or generic). Thus new quantization tables need to be designed for these class-specific coding models.