The present invention generally relates to digital speech coding at low bit rates, and more particularly, is directed to an improved method for coding the excitation information for code-excited linear predictive speech coders.
Code-excited linear prediction (CELP) is a speech coding technique which has the potential of producing high quality synthesized speech at low bit rates, i.e., 4.8 to 9.6 kilobits-per-second (kbps). This class of speech coding, also known as vector-excited linear prediction or stochastic coding, will most likely be used in numerous speech communications and speech synthesis applications. CELP may prove to be particularly applicable to digital speech encryption and digital radiotelephone communication systems wherein speech quality, data rate, size, and cost are significant issues.
In a CELP speech coder, the long term ("pitch") and short term ("formant") predictors which model the characteristics of the input speech signal are incorporated in a set of time-varying linear filters. An excitation signal for the filters is chosen from a codebook of stored innovation sequences, or code vectors. For each frame of speech, the speech coder applies each individual code vector to the filters to generate a reconstructed speech signal, and compares the original input speech signal to the reconstructed signal to create an error signal. The error signal is then weighted by passing it through a weighting filter having a response based on human auditory perception. The optimum excitation signal is determined by selecting the code vector which produces the weighted error signal with the minimum energy for the current frame.
The term "code-excited" or "vector-excited" is derived from the fact that the excitation sequence for the speech coder is vector quantized, i.e., a single codeword is used to represent a sequence, or vector, of excitation samples. In this way, data rates of less than one bit per sample are possible for coding the excitation sequence. The stored excitation code vectors generally consist of independent random white Gaussian sequences. One code vector from the codebook is used to represent each block of N excitation samples. Each stored code vector is represented by a codeword, i.e., the address of the code vector memory location. It is this codeword that is subsequently sent over a communications channel to the speech synthesizer to reconstruct the speech frame at the receiver. See M. R. Schroeder and B. S. Atal, "Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 3, pp. 937-40, March 1985, for a detailed explanation of CELP.
The difficulty of the CELP speech coding technique lies in the extremely high computational complexity of performing an exhaustive search of all the excitation code vectors in the codebook. For example, at a sampling rate of 8 kilohertz (kHz), a 5 millisecond (msec) frame of speech would consist of 40 samples. If the excitation information were coded at a rate of 0.25 bits per sample (corresponding to 2 kbps), then 10 bits of information are used to code each frame. Hence, the random codebook would then contain 2.sup.10, or 1024, random code vectors. The vector search procedure requires approximately 15 multiply-accumulate (MAC) computations (assuming a third order long-term predictor and a tenth order short-term predictor) for each of the 40 samples in each code vector. This corresponds to 600 MACs per code vector per 5 msec speech frame, or approximately 120,000,000 MACs per second (600 MACs/5 msec frame.times.1024 code vectors). One can now appreciate the extraordinary computational effort required to search the entire codebook of 1024 vectors for the best fit--an unreasonable task for real-time implementation with today's digital signal processing technology.
Moreover, the memory allocation requirement to store the codebook of independent random vectors is also exorbitant. For the above example, a 640 kilobit read-only-memory (ROM) would be required to store all 1024 code vectors, each having 40 samples, each sample represented by a 16-bit word. This ROM size requirement is inconsistent with the size and cost goals of many speech coding applications. Hence, prior art code-excited linear prediction is presently not a practical approach to speech coding.
One alternative for reducing the computational complexity of this code vector search process is to implement the search calculations in a transform domain. Refer to I. M. Trancoso and B. S. Atal, "Efficient Procedures for Finding the Optimum Innovation in Stochastic Coders", Proc. ICASSP, Vol. 4, pp. 2375-8, April 1986, as an example of such a procedure. Using this approach, discrete Fourier transforms (DFT's) or other transforms may be used to express the filter response in the transform domain such that the filter computations are reduced to a single MAC operation per sample per code vector. However, an additional 2 MACs per sample per code vector are also required to evaluate the code vector, thus resulting in a substantial number of multiply-accumulate operations, i.e., 120 per code vector per 5 msec frame, or 24,000,000 MACs per second in the above example. Still further, the transform approach requires at least twice the amount of memory, since the transform of each code vector must also be stored. In the above example, a 1.3 Megabit ROM would be required for implementing CELP using transforms.
A second approach for reducing the computational complexity is to structure the excitation codebook such that the code vectors are no longer independent of each other. In this manner, the filtered version of a code vector can be computed from the filtered version of the previous code vector, again using only a single filter computation MAC per sample. This approach results in approximately the same computational requirements as transform techniques, i.e., 24,000,000 MACs per second, while significantly reducing the amount of ROM required (16 kilobits in the above example). Examples of these types of codebooks are given in the article entitled "Speech Coding Using Efficient Pseudo-Stochastic Block Codes", Proc. ICASSP, Vol. 3, pp. 1354-7, April 1987, by D. Lin. Nevertheless, 24,000,000 MACs per second is presently beyond the computational capability of a single DSP. Moreover, the ROM size is based on 2.sup.M .times.# bits/word, where M is the number of bits in the codeword such that the codebook contains 2.sup.M code vectors. Therefore, the memory requirements still increase exponentially with the number of bits used to encode the frame of excitation information. For example, the ROM requirements increase to 64 kilobits when using 12 bit codewords.
A need, therefore, exists to provide an improved speech coding technique that addresses both the problems of extremely high computational complexity for exhaustive codebook searching, as well as the vast memory requirements for storing the excitation code vectors.