The demand for efficient digital wideband speech/audio encoding techniques with a good subjective quality/bit rate trade-off is increasing for numerous applications such as audio/video teleconferencing, multimedia, and wireless applications, as well as Internet and packet network applications. Until recently, telephone bandwidths filtered in the range of 200-3400 Hz were mainly used in speech coding applications. However, there is an increasing demand for wideband speech applications in order to increase the intelligibility and naturalness of the speech signals. A bandwidth in the range 50-7000 Hz was found sufficient for delivering a face-to-face speech quality. For audio signals, this range gives an acceptable audio quality, but is still lower than the CD (Compact Disk) quality which operates in the range 20-20000 Hz.
A speech encoder converts a speech signal into a digital bit stream which is transmitted over a communication channel (or stored in a storage medium). The speech signal is digitized (sampled and quantized with usually 16-bits per sample) and the speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
One of the best prior art techniques capable of achieving a good quality/bit rate trade-off is the so-called CELP (Code Excited Linear Prediction) technique. According to this technique, the sampled speech signal is processed in successive blocks of L samples usually called frames where L is some predetermined number (corresponding to 10-30 ms of speech). In CELP, an LP (Linear Prediction) synthesis filter is computed and transmitted every frame. The L-sample frame is then divided into smaller blocks called subframes of N samples, where L=kN and k is the number of subframes in a frame (N usually corresponds to 4-10 ms of speech). An excitation signal is determined in each subframe, which usually consists of two components: one from the past excitation (also called pitch contribution or adaptive codebook) and the other from an innovative codebook (also called fixed codebook). This excitation signal is transmitted and used at the decoder as the input of the LP synthesis filter in order to obtain the synthesized speech.
To synthesize speech according to the CELP technique, each block of N samples is synthesized by filtering an appropriate codevector from the innovative codebook through time-varying filters modeling the spectral characteristics of the speech signal. These filters consist of a pitch synthesis filter (usually implemented as an adaptive codebook containing the past excitation signal) and an LP synthesis filter. At the encoder end, the synthesis output is computed for all, or a subset, of the codevectors from the innovative codebook (codebook search). The retained innovative codevector is the one producing the synthesis output closest to the original speech signal according to a perceptually weighted distortion measure. This perceptual weighting is performed using a so-called perceptual weighting filter, which is usually derived from the LP synthesis filter.
In the CELP context, an innovative codebook is an indexed set of N-sample-long sequences which will be referred to as N-dimensional codevectors. Each codebook sequence is indexed by an integer k ranging from 0 to Mc−1 where Mc represents the size of the innovative codebook often expressed as a number of bits b, where Mc=2b.
A codebook can be stored in a physical memory, e.g. a look-up table (stochastic codebook), or can refer to a mechanism for relating the index to a corresponding codevector, e.g. a formula (algebraic codebook).
A drawback of the first type of codebooks, the stochastic codebooks, is that they often involve substantial physical storage. They are stochastic, i.e. random in the sense that the path from the index to the associated codevector involves look-up tables which are the result of randomly generated numbers or statistical techniques applied to large speech training sets. The size of stochastic codebooks tends to be limited by storage and/or search complexity.
The second type of codebooks are the algebraic codebooks. By contrast with the stochastic codebooks, algebraic codebooks are not random and require no substantial storage. An algebraic codebook is a set of indexed codevectors of which the amplitudes and positions of the pulses of the kth codevector can be derived from a corresponding index k through a rule requiring no, or minimal, physical storage. Therefore, the size of algebraic codebooks is not limited by storage requirements. Algebraic codebooks can also be designed for efficient search.
The CELP model has been very successful in encoding telephone band sound signals, and several CELP-based standards exist in a wide range of applications, especially in digital cellular applications. In the telephone band, the sound signal is band-limited to 200-3400 Hz and sampled at 8000 samples/sec. In wideband speech/audio applications, the sound signal is band-limited to 50-7000 Hz and sampled at 16000 samples/sec.
An important issue that arises in coding wideband signals is the need to use very large excitation codebooks. Therefore, efficient codebook structures that require minimal storage and can be rapidly searched become very important. Algebraic codebooks have been known for their efficiency and are now widely used in various speech coding standards. Algebraic codebooks with larger number of bits can be searched efficiently using non-exhaustive search methods. Examples are the nested-loop search [4], the depth-first tree search [5] that searches pulses in subsets of pulses, and the global pulse replacement [6]. A simple search was used in ITU-T Recommendation G.723.1 [7] similar to the multipulse sequential search [3]. In Reference [7], the excitation consists of several signed pulses in a frame (no track structure as in ACELP) with a fixed gain for all pulses. The pulses are sequentially searched by updating the so-called backward filtered target signal d(n) and placing the new pulse at the absolute maximum of the signal d(n). The search is repeated for several gain values but the gain is assumed constant during each iteration.