In speech recognition or speech synthesis systems, digital speech is generally sampled at the Nyquist sampling rate, 2 times the input signal bandwidth, or an 8 kHz sampling rate which results in 8,000 samplings a second. Therefore 128,000 bits/second are necessary to effect an 8 kHz sampling rate using 16 bits/sample. As can easily be seen, just 10 seconds worth of input digital speech can require over a million bits of data. Therefore, speech coding algorithms were developed as a means to reduce the number of bits required to model the input speech while still maintaining a good match with the input speech.
Code-Excited Linear Prediction (CELP) is a well known class of speech coding algorithms with good performance at low to medium bit rates (4 to 16 Kb/s). CELP coders typically use a 10th order LPC filter excited by the sum of adaptive and fixed excitation codevectors for speech synthesis. The input speech is divided into fixed length segments called frames for LPC analysis, and each frame is further divided into smaller fixed length segments called subframes for adaptive and fixed codebook excitation search. Much of the complexity of a CELP coder can be attributed to the adaptive and fixed codebook excitation search mechanisms.
As shown in FIG. 1, the CELP coder consists of an encoder/decoder pair. The encoder, as shown in FIG. 2, processes each frame of speech by computing a set of parameters which it codes and transmits to the decoder. The decoder, as shown in FIG. 3, receives the information and synthesizes an approximation to the input speech, called the coded speech. The parameters transmitted to the decoder consist of the Linear Prediction Coefficients (LPC), which specify a time-varying all-pole filter called the LPC synthesis filter, and excitation parameters specifying a time-domain waveform called the excitation signal. The excitation signal comprises the adaptive codebook excitation and the fixed (or pulsed) excitation, as shown in FIGS. 2 and 3. The decoder reconstructs the excitation signal and passes it through the LPC synthesis filter to obtain the coded speech.
The LPC prediction parameters, obtained by LPC analysis, are converted to log-area-ratios (LARs), and can be scalar quantized using, for example, 38 bits by the encoder. An example of the bit allocation for the 10 LARs is as follows: 5,5,4,4,4,4,3,3,3,3.
The excitation signal is a sum of two components obtained by two different codebooks, a multitap adaptive codebook and a fixed excitation codebook. A multitap adaptive codebook, with 3 taps, is employed to encode the pseudo-periodic pitch component of the linear prediction residual. An open-loop pitch prediction scheme is used to provide a pitch cue, in order to restrict the closed-loop multitap adaptive codebook search range to 8 lag levels around it. The adaptive codebook consists of a linear combination of 3 adjacent time-shifted versions of the past excitation. These 3 adjacent time-shifted versions of the past excitation are generally extremely complex to originate and require thousands of computations. In addition, the fixed excitation codebook search is generally a very complex operation when performed optimally. Codebook entries can also be selected by one of several sub-optimal process' which results in a distortion of the original speech signal achieving a trade-off between complexity and quality which is not suitable for some applications.