Code Excited Linear Predictive (CELP) coding is a well-known stochastic coding technique for speech communication. In CELP coding, the short-time spectral and long-time pitch are modeled by a set of time-varying linear filters. In a typical speech coder based communication system, speech is sampled by an A/D converter at approximately twice the highest frequency desired to be transmitted, e.g., an 8 KHz sampling frequency is typically used for a 4 KHz voice bandwidth. CELP coding synthesizes speech by utilizing encoded excitation information to excite a linear predictive (LPC) filter. The excitation, which is used as inputs to the filters, is modeled by a codebook of white Gaussian signals. The optimum excitation is found by searching through a codebook of candidate excitation vectors on a frame-by-frame basis.
LPC analysis is performed on the input speech frame to determine the LPC parameters. Then the analysis proceeds by comparing the output of the LPC filter with the digitized input speech, when the LPC filter is excited by various candidate vectors from the table, i.e., the code book. The best candidate vector is chosen based on how well speech synthesized using the candidate excitation vector matches the input speech. This is usually performed on several subframes of speech.
After the best match has been found, information specifying the best codebook entry, the LPC filter coefficients and the gain coefficients are transmitted to the synthesizer. The synthesizer has the same copy of the codebook and accesses the appropriate entry in that codebook, using it to excite the same LPC filter.
The codebook is made up of vectors whose components are consecutive excitation samples. Each vector contains the same number of excitation samples as there are speech samples in the subframe or frame. The excitation samples can come from a number of different sources. Long term pitch coding is determined by the proper selection of a code vector from an adaptive codebook. The adaptive codebook is a set of different pitch periods of the previously synthesized speech excitation waveform.
The optimum selection of a code vector, either from the stochastic or the adaptive codebooks, depends on minimizing the perceptually weighted error function. This error function is typically derived from a comparison between the synthesized speech and the target speech for each vector in the codebook. These exhaustive comparison procedures require a large amount of computation and are usually not practical for a single Digital Signal Processor (DSP) to implement in real time. The ability to reduce the computation complexity without sacrificing voice quality is important in the digital communications environment.
The error function, codebook vector search, calculations are performed using vector and matrix operations of the excitation information and the LPC filter. The problem is that a large number of calculations, for example, approximately 5.times.10.sup.8 multiply-add operations per second for a 4.8 Kbps vocoder, must be performed. Prior art arrangements have not been entirely successful in reducing the number of calculations that must be performed. Thus, a need continues to exist for improved CELP coding means and methods that reduce the computational burden without sacrificing voice quality.
A prior art 4.8k bit/second CELP coding system is described in Federal Standard FED-STD-1016 issued by the General Services Administration of the United States Government. Prior art CELP vocoder systems are described for example in U.S. Pat. Nos. 4,899,385 and 4,910,781 to Ketchum et al., U.S. Pat. No. 4,220,819 to Atal, U.S. Pat. No. 4,797,925 to Lin, and U.S. Pat. No. 4,817,157 to Gerson, which are incorporated herein by reference.
Typical prior art CELP vocoder systems use an 8 kHz sampling rate and a 30 millisecond frame duration divided into four 7.5 millisecond subframes. Prior art CELP coding consists of three basic functions: (1) short delay "spectrum" prediction, (2) long delay "pitch" search, and (3) residual "code book" search.
While the present invention is described for the case of analog signals representing human speech, this is merely for convenience of explanation and, as used herein, the word "speech" is intended to include any form of analog signal of bandwidth within the sampling capability of the system.