The present invention generally relates to digital speech coding at low bit rates, and more particularly, is directed to an improved method for determining long-term predictor output responses for code-excited linear prediction speech coders.
Code-excited linear prediction (CELP) is a speech coding technique which has the potential of producing high quality synthesized speech at low bit rates, i.e., 4.8 to 9.6 kilobits-per-second (kbps). This class of speech coding, also known as vector-excited linear prediction or stochastic coding, will most likely be used in numerous speech communications and speech synthesis applications. CELP may prove to be particularly applicable to digital speech encryption and digital radiotelephone communication systems wherein speech quality, data rate, size, and cost are significant issues.
The term "code-excited" or "vector-excited" is derived from the fact that the excitation sequence for the speech coder is vector quantized, i.e., a single codeword is used to represent a sequence, or vector, of excitation samples. In this way, data rates of less than one bit per sample are possible for coding the excitation sequence. The stored excitation code vectors generally consist of independent random white Gaussian sequences. One code vector from the codebook is chosen to represent each block of N excitation samples. Each stored code vector is represented by a codeword, i.e., the address of the code vector memory location. It is this codeword that is subsequently sent over a communications channel to the speech synthesizer to reconstruct the speech frame at the receiver. See M. R. Schroeder and B. S. Atal, "Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 3, pp. 937-40, March 1985, for a more detailed explanation of CELP.
In a CELP speech coder, the excitation code vector from the codebook is applied to two time-varying linear filters which model the characteristics of the input speech signal. The first filter includes a long-term predictor in its feedback loop, which has a long delay, i.e., 2 to 15 milliseconds, used to introduce the pitch periodicity of voiced speech. The second filter includes a short-term predictor in its feedback loop, which has a short delay, i.e., less than 2 msec, used to introduce a spectral envelope or format structure. For each frame of speech, the speech coder applies each individual code vector to the filters to generate a reconstructed speech signal, and compares the original input speech signal to the reconstructed signal to create an error signal. The error signal is then weighted by passing it through a weighting filter having a response based on human auditory perception. The optimum excitation signal is determined by selecting the code vector which produces the weighted error signal having the minimum energy for the current frame. The codeword for the optimum code vector is then transmitted over a communications channel.
In a CELP speech synthesizer, the codeword received from the channel is used to address the codebook of excitation vectors. The single code vector is then multiplied by a gain factor, and filtered by the long-term and short-term filters to obtain a reconstructed speech vector. The gain factor and the predictor parameters are also obtained from the channel. It has been found that a better quality synthesized signal can be generated if the actual parameter used by the synthesizer are used in the analysis stage, thus minimizing the quantization errors. Hence, the use of these synthesis parameters in the CELP speech analysis stage to produce higher quality speech is referred to as analysis-by-synthesis speech coding.
The short-term predictor attempts to predict the current output sample s(n) by a linear combination of the immediately preceding output samples s(n-i), according to the equation: EQU s(n)=.alpha..sub.1 s(n-1)+.alpha..sub.2 s(n-2)+. . . +.alpha..sub.p s(n-p)+e(n)
where p is the order of the short-term predictor, and e(n) is the prediction residual, i.e., that part of s(n) that cannot be represented by the weighted sum of p previous samples. The predictor order p typically ranges from 8 to 12, assuming an 8 kilohertz (kHz) sampling rate. The weights .alpha..sub.1, .alpha..sub.2, .alpha..sub.p, in this equation are called the predictor coefficients. The short-term predictor coefficients are determined from the speech signal using conventional linear predictive coding (LPC) techniques. The output response of the short-term filter may be expressed in Z-transform notation as: ##EQU1## Refer to the article entitled "Predictive Coding of Speech at Low Bit Rates", IEEE Trans. Commun., Vol. COM-30, pp. 600-14, April 1982, by B. S. Atal, for further discussion of the short-term filter parameters.
The long-term filter, on the other hand, must predict the next output sample from preceding samples that extend over a much longer time period. If only a single past sample is used in the predictor, then the predictor is a single-tap predictor. Typically, one to three taps are used. The output response for a long-term filter incorporating a single-tap, long-term predictor is given in z-transform notation as:. ##EQU2## Note that this output response is a function of only the delay or lag L of the filter and the filter coefficient .beta.. For voiced speech, the lag L would typically be the pitch period of the speech, or a multiple of it. At a sampling rate of 8 kHz, a suitable range for the lag L would be between 16 and 143, which corresponds to a pitch range between 500 Hz to 56 Hz, respectively.
The long-term predictor lag L and long-term predictor coefficient .beta. can be determined from either an open-loop or a closed loop configuration. Using the open-loop configuration, the lag L and coefficient .beta. are computed from the input signal (or its residual) directly. In the closed loop configuration, the lag-L, and the coefficient .beta. are computed at the frame rate from coded data representing the past output of the long-term filter and the input speech signal. In using the coded data, the long-term predictor lag determination is based on the actual long-term filter state that will exist at the synthesizer. Hence, the closed-loop configuration gives better performance than the open-loop method, since the pitch filter itself would be contributing to the optimization of the error signal. Moreover, a single-tap predictor works very well in the closed-loop configuration.
Using the closed-loop configuration, the long-term filter output response b(n) is determined from only past output samples from the long-term filter, and from the current input speech samples s(n) according to the equation: EQU b(n)=s(n)+.beta.b(n-L)
This technique is straightforward for pitch lags L which are greater than the frame length N, i.e., when L.ltoreq.N, since the term b(n-L) will always represent a past sample for all sample numbers n, 0.ltoreq.n.ltoreq.N-1. Furthermore, in the case of L&gt;N, the excitation gain factor .gamma. and the long-term predictor coefficient 62 can be simultaneously optimized for given values of lag L and codeword i. It has been found that this joint optimization technique yields a noticeable improvement in speech quality.
If, however, long-term predictor lags L of less than the frame length N must be accommodated, the closed-loop approach fails. This problem can readily occur in the case of high-pitched female speech. For example, a female voice corresponding to a pitch frequency of 250 Hz may require a long-term predictor lag L equal to 4 milliseconds (msec). A pitch of 250 Hz at an 8 kHz sampling rate corresponds to a long-term predictor lag L of 32 samples. It is not desirable, however, to employ frame length N of less than 4 msec, since the CELP excitation vector can be coded more efficiently when longer frame lengths are used. Accordingly, utilizing a frame length time of 7.5 msec at a sampling rate of 8 kHz, the frame length N would be equal to 60 samples. This means only 32 past samples would be available to predict the next 60 samples of the frame. hence, if the long-term predictor lag L is less than the frame length N, only L past samples of the required N samples are defined.
Several alternative approaches have been taken in the prior art to address the problem of pitch lags L being less than frame length N. In attempting to jointly optimize the long-term predictor lag L and coefficient .beta., the first approach would be to attempt to solve the equations directly, assuming no excitation signal to present. This approach is explained in the article entitled "Regular-Pulse Excitation - A Novel Approach to Effective and Efficient Multipulse Coding of Speech" by Kroon, et al., IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP- 34, No. 5, October 1986, pp. 1054-1063. However, in following this approach, a nonlinear equation in the single parameter .beta. must be solved. The solution of the quadratic or cubic in .beta. must be solved. The solution of the quadratic or cubic in .beta. is computationally impractical. Moreover, jointly optimizing the coefficient .beta. with the gain factor .gamma. is still not possible with this approach.
A second solution, that of limiting the long-term predictor delay L to be greater than the frame length N, is proposed by Singhal and Atal in the article "Improving Performance of MultiPulse LPC Coders at Low Bit Rates", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, March 19-21, 1984, pp. 1.3.1-1.3.4. This artificial constraint on the pitch lag L often does not accurately represent the pitch information. Accordingly, using this approach, the voice quality is degraded for high-pitched speech.
A third solution is to reduce the size of the frame length N. With a shorter frame length, the long-term predictor lag L can always be determined from past samples. This approach, however, suffers from a severe bit rate penalty. With a shorter frame length, a greater number of long-term predictor parameters and excitation vectors must be coded, and accordingly, the bit rate of the channel must be greater to accommodate the extra coding.
A second problem exists for high pitch speakers. The sampling rate used in the coder places an upper limit on the performance of a single-tap pitch predictor. For example, if the pitch frequency is actually 485 Hz, the closest lag value would be 16 which corresponds to 500 Hz. This results in an error of 15 Hz for the fundamental pitch frequency which degrades voice quality. This error is multiplied for the harmonics of the pitch frequency causing further degradation.
A need, therefore, exists to provide an improved method for determining the long-term predictor lag L. The optimum solution must address both the problems of computational complexity and voice quality for the coding of high-pitched speech.