The present invention relates to improvements in a method for digital compression of speech and other audio signals, and, more particularly, to improvements in stochastic code excited linear predictive encoding.
Code Excited Linear Predictive encoding (CELP) is well-known as a means of digitally compressing speech and other audio signals for improving the efficiency of communication. Using CELP, the speech to be transmitted, referred to hereinafter as the "target speech," is analyzed by an encoder to determine a set of parameters and indices in a codebook of excitation vectors which best characterize the actual target speech waveform. It is these parameters and codebook indices which are transmitted, rather than signals representing the waveform of the target speech itself. Doing so realizes substantial savings in transmission costs, since the parameters and codebook indices require far less bandwidth to transmit than unprocessed speech. At the other end of the transmission, a compatible decoder synthesizes waveforms according to the received parameters and codebook indices, and thereby reconstructs the target speech. The present application uses the term "speech" to denote any analogs signals over a spectrum up to 4 KHz.
In order to perform the analysis by which the codebook indices and parameters are determined, the original analog target speech waveform is first digitally sampled according to the Nyquist criterion at a minimum of twice the maximum frequency of the desired spectrum. For example, to attain a commonly-found 4 KHz maximum frequency, the sampling rate must be at least 8 KHz. The speech samples are then divided into sequential time frames. A typical frame at an 8 KHz sampling rate would contain 160 samples, corresponding to a 20 msec segment of speech. The frames are next divided into subframes. The codebook excitation vectors, represent Gaussian noise samples; their vector size corresponds to the number of samples in a subframe. Hereinafter, N denotes the number of excitation vectors in a codebook. Typically, N is of the order of 128. When the appropriate excitation vector is selected from such a codebook and input into a weighted synthesis filter which has been set with suitable linear predictive coefficients (LPC's), the output of the weighted synthesis filter is a waveform which can closely approximate a segment of the speech waveform. It is the index of this excitation vector in the codebook which is transmitted along with the LPC's and associated parameters to compress the speech of that segment. All of the filters used in such an encoder are linear filters, and therefore when reference is made to a filter in the present application, it will be understood that it is a linear filter.
A crucial portion of the analysis performed by the encoder, therefore, is a search through the codebook to find the optimum excitation vector to use. This requires testing all the excitation vectors one at a time, by sending each excitation vector to the input of the weighted synthesis filter, and then comparing the output of the weighted synthesis filter to the sampled target speech waveform. The excitation vector which yields the closest fit to the target speech segment is selected. This excitation vector is simply and easily referenced by its index in the codebook and therefore specifying i is equivalent to specifying c.sub.i.
FIG. 1, to which reference is now briefly made, illustrates conceptually the prior art method for selecting the optimum excitation vector from a codebook. Each excitation vector in the codebook is referenced by an index i, c.sub.i is thus the excitation vector corresponding to the index i. The target speech sample 14 t(n) is processed by a weighting filter 16 which is a function of the LPC, to yield the weighted target speech sample t.sub.w (n). Each excitation vector c.sub.i of the codebook 10 is processed by the weighted synthesis filter 12 to result in a weighted synthesized speech sediment S.sub.i (n), Which is compared against weighted target speech sample by comparator 18, Whose output is the difference t.sub.w (n)-S.sub.i (n), which is the error vector E(n). Error computation 20 computes the mean squared error over the error vector for each codebook index i. The index i whose c.sub.i has minimal mean squared error is the selected index.
In practice, the computation for selecting the codebook index is different from the conceptual procedure illustrated in FIG. 1, although it is mathematically equivalent. The impulse response of the weighted synthesis filter is a matrix denoted by H, which may be selected, for example, to be the truncated impulse response of the weighted synthesis filter. The matrix H will be changed from one adaptive codebook subframe to the next. As is known in the art, the optimum excitation vector c.sub.i selected by the process illustrated in FIG. 1 has the property that there is a selection function which is maximum over the set of excitation vectors in the codebook for c.sub.i. This selection function is usually given as the error function .epsilon..sub.i. ##EQU1## where t.sub.w.sup.T is the transpose of t.sub.w. The numerator of Equation (1) is the square of the cross-correlation of t.sub.w with the convolution of the impulse response H with the excitation vector c.sub.i. In general, a selection function will be a function of the energy term .parallel.Hc.sub.i .parallel..sup.2, which is the self-correlation of the convolution of the impulse response H with the excitation vector c.sub.i. When the error function is used as the selection function, Equation (1) is evaluated for each excitation vector to determine the optimal c.sub.i, and hence the desired index i. The vector quantity Hc.sub.i is the convolution of the impulse response of the weighted synthesis filter with the excitation vector c.sub.i, and therefore represents the excited weighted synthesized speech segment S.sub.i as shown in FIG. 1, which is the output of the weighted synthesis filter. A measure of similarity of the excited weighted synthesized speech segment S.sub.i and the target speech sample t.sub.w is their cross-correlation, t.sub.w.sup.T .multidot.Hc.sub.i. This is a scalar quantity, and the higher its value, the closer the excited weighted synthesized speech segment S.sub.i is to the target speech sample t.sub.w, and the better the excitation vector c.sub.i is for synthesizing the output speech sample. The numerator of the right-hand side expression in Equation (1) is the square of the cross-correlation of the excited weighted synthesized speech segment and the target speech sample. The denominator of the right-hand side expression in Equation (1) represents the energy term of the excited weighted synthesized speech segment S.sub.i. Note that the convolution of H and c.sub.i is an important operation which appears in several places in the calculation of .epsilon..sub.i.
Usually, CELP encoders utilize a pair of codebooks: an adaptive codebook and a fixed stochastic codebook. The excitation vectors of the fixed stochastic codebook are constant, whereas those of the adaptive codebook are updated by the encoder to accommodate the particular characteristics of the current target speech waveform. In analyzing a target speech waveform segment, an excitation vector is selected from each codebook. The two excitation vectors are combined in a weighted linear fashion and then sent as an input to the weighted synthesis filter. The procedure for selecting the optimum excitation vector as discussed above and illustrated in FIG. 1, and equivalently manifest in Equation (1), must be carried out for each of the codebooks.
Unfortunately, intensive numerical computation is needed to evaluate Equation (1), and so the processing required for codebook searching presents a major obstacle to improved CELP performance. Therefore, this is an area of interest in the field. For example, "Real-Time Vector Excitation Coding of Speech at 4800 BPS" by Davidson et al. (in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April, 1987, pages 2189-2192) explores issues as the use of small, optimized codebooks that are easier to search, and presents an approximation for the evaluation of the energy term as given in Equation (1) by an autocorrelation approach which requires reduced computation U.S. Pat. No. 5,265,190 discloses a method of simplifying the convolution computation in the cross-correlation terms for adaptive codebook searching. While improvements such as these have been useful in reducing the complexity of codebook searching, however, the computation is still intensive, and moreover does not address some of the specific needs of fixed stochastic codebook searching. For example. U.S. Pat. No. 5,265,190 does not disclose methods for fixed stochastic codebook searches, and, moreover, the method disclosed therein applies only to the cross-correlation term but not to the energy term.
Thus there is a recognized need for, and it would be advantageous to have, methods of further reducing the amount of processing needed to select the optimum excitation vector from a codebook, in particular for a CELP encoder that has both a fixed stochastic codebook as well as an adaptive codebook. The innovation of the present invention attains this goal for a certain class of CELP encoders with both an adaptive codebook and a fixed stochastic codebook. In addition, CELP techniques currently attain a very high degree of perceptual fidelity, and it is desired to retain this fidelity while making improvements to the CELP process itself. Therefore, a further goal realized by the present invention is the improvement of processing efficiencies without the introduction of any perceptible distortion or other degradation in the quality of the reconstructed speech.