The present invention relates to a method for encoding speech.
It is highly desirable to be able to store and transmit speech signals using a reduced bandwidth. For example, if 8000 Hz of a speech signal is sampled at the Nyquist rate with 12-bit accuracy, the resulting data rate required is almost 200 kilobits per second of speech. Since the actual information content of speech is far smaller than this, it is extremely desirable to reduce the data rate required to encode speech down to something closer to the actual information content as received by a human listener. Such compressed speech coding has three principal areas of application, each of major importance: synthetic speech, transmission of spoken messages, and speech recognition.
A principal area of efforts to accomplish this end has been linear predictive coding of speech. In the general linear prediction model, a signal s.sub.n is considered to be the output of a system with an input u.sub.n such that the following relation holds: ##EQU1## where b.sub.0 is defined as one, and a.sub.k (k ranging over integers between 1 and p inclusive), b.sub.m (m ranging over integers between 1 and q inclusive), and the gain G are the parameters of the hypothesized system. Since the signal s.sub.n is modeled as a linear function of past outputs and present and past inputs, linear prediction from these outputs and inputs specifies the value of s.sub.n.
A slightly simplified version of this model, which is much more tractable, is the autoregressive or all-pole model. In this model, the signal s.sub.n is assumed to be a linear combination of the p most recent past values and of a single input value u.sub.n : ##EQU2## where G is a gain factor. By taking the z transform of both sides of this equation, the system transfer function H(z) is ##EQU3## Given a particular signal sequence s.sub.n, analysis according to this model produces predictor coefficients a.sub.k and the gain G as speech parameters, in addition to the (assumed) input signal u.sub.n.
In a widely used model of human speech, the human voice is modeled as a combination of an excitation function (input signal) with a linear predictive filter. Once the system has been analyzed in this fashion, the excitation function can normally be transmitted at quite a low bit rate.
To represent speech in accordance with the LPC model, the predictor coefficients a.sub.k, or some equivalent set of parameters, must be transmitted to permit the correct linear predictor to be used in the resynthesized speech signal which is reconstructed at the receiver. In the prior art, reflection coefficients k.sub.i have often been used as the transmitted parameters. Another alternative set of parameters is the set of poles of the transfer function H(z). The desirable features to be selected for, in deciding which set of parameters is to represent the LPC model, include: 1. The stability of the LPC filter should be guaranteed. This is true with poles or reflection coefficients, but not with predictor coefficients. 2. The parameters transmitted should preferably correspond fairly closely to perceptual parameters, to permit perceptually efficient use of bandwidth. This is a particular advantage of poles. 3. A minimum computational load should be imposed, at both transmitting and receiving ends. 4. Preferably the parameters should have a natural ordering.
An optimized system which satisfies the above requirements is of course very useful not only for transmitting speech, but also for storing synthetic speech. Such a system also has benefits in the areas of speech recognition and speaker identification.
A particular requirement of synthetic speech is a minimum bit rate per second of speech and a minimum computational load at the speech decoder. If these criteria can be achieved, a quite heavy computational load in encoding can be tolerated.
Thus, it is an object of the present invention to provide a method for storing synthetic speech at a very low bit rate, such that the stored synthetic speech can be decoded with only a small computational load.
Simultaneously-filed application No. 373,959, now U.S. Pat. No. 4,536,886, which is hereby incorporated by reference, teaches a method for encoding the roots of the LPC inverse filter. However, since the study of spectrograms shows slow time varying behavior of the formants of human speech, repeated direct encoding of the poles (which show time-varying behavior generally corresponding to that of the formants) would miss the major data redundancy which is provided by the slow change of phase of the poles over time, and thus would consume unnecessary bandwidth.
It is an object of the present invention to provide a method for encoding speech with minimum bandwidth.
It is a further object of the present invention to provide a method for encoding speech by using the poles of the linear predictive coding model, without requiring unnecessary bandwidth.
It is a further object of the present invention to provide a method for encoding speech, according to the poles of the LPC model, which tracks the behavior of pole parameters over time.
It is a further object of the present invention to provide a method for encoding speech according to the poles of the LPC model, which tracks the behavior of pole parameters over time using a minimum number of bits.
The behavior of other speech parameters shows relatively smooth behavior over time period. In particular, the reflection coefficients are likely to be well behaved. A particular advantage of reflection coefficients or poles over predictor coefficients is that stability of the LPC filter, in the receiver, is guaranteed. That is, a relatively small error in the values of the predictor coefficients can introduce instability unpredictably.
Thus, it is a further object of the present invention to provide a method for including the behavior of speech parameters over time, using a minimum number of bits.
Prior art has suggested time-tracking of speech parameters, specifically including LPC parameters, to reduce required bandwidth. See D. T. Magill, "Adaptive Speech Compression for Packet Communication Systems", Telecommunication Conference Record, IEEE publication 73 CHO 805-2, 29d 1-5, 1973; J. Makhoul et al, "Natural Communication with Computers", Final Report, Vol. 2, Speech Compression at BBN, Report No. 2976, December 1974; and R. Viswanathan et al, "Speech Compression and Evaluation", Final Report, BBN Report No. 3794, April 1978. The Magill method transmitted a new set of speech parameters only after the vocal track filter was detected to have changed significantly. Change was measured as dissimilarity between adjacent frames, and it was measured by a distance metric which is equivalent to Itakura's log-likelihood ratio. The Makhoul et al and Viswanathan et al approaches interpolated parameters between transmitted and frames, introduced thresholds for the dissimilarity measure so that interpolation between very different data frames is avoided, and used dissimilarity measures other than the log-likelihood ratio.