This invention relates to linear predictive coding. In particular, it is a method and means of reducing the required channel bandwidth by minimizing the amount of information that is sent in transmitting an electrical signal representing speech or any other signal that has been subjected to linear predictive coding.
Linear predictive coding (LPC) is a method of digital coding that is particularly useful for voice signals. It makes use of the fact that in speech the pressure variations that constitute the sound follow patterns that stay the same for relatively long periods of time. LPC is normally applied to make use of four items of information about speech. The first of these is that speech may be either voiced or unvoiced. Voiced signals are signals that begin with a buzz from the vocal cords, while the vocal cords are inactive for unvoiced signals. Either voiced or unvoiced signals are further characterized by three sets of parameters. The first of these is energy or gain which is a measure of the loudness of a speaker. The second is pitch which is the fundamental frequency, if any, that is being generated by the vocal cords of a speaker. The third is some measure of the filtering effect of the vocal tract upon the vibrations generated by the vocal cords or other sound-generating mechanisms. Unvoiced signals are characterized only by energy and vocal-tract parameters; they have no pitch. The vocal tract is typically modeled as a stacked array of cylindrical tubes of varying lengths and diameters that has a series of damped mechanical resonances. The generation of a particular speech sound is carried out, both conceptually and actually, by passing the buzz of the vocal cords or the noise of moving air through the array of resonant structures. When a speaker changes the particular sound that he is making without changing his pitch, he is changing the dimensions of the resonant structures and hence the effect of the vocal tract upon the excitation signal generated by the vocal cords.
It would be possible to characterize the resonances of the vocal tract in a number of ways. These include characterizing the impulse response of the inverse filter, the coefficient values for the LPC direct-form filter, the autocorrelation function of the input speech, filter coefficients for the lattice filter (the so-called reflection coefficients k), the coefficients of the discrete Fourier transform of the autocorrelation function, and various transformations of the reflection coefficients. Speech can then be described in a digital system by developing digital characterizations of the voicing, the energy, the pitch, and of an equivalent filter. Because of the nature of speech, a particular set of filter coefficients will hold essentially the same values for tens or hundreds of milliseconds. This enables the characterization of speech to be made with sufficient fidelity by chopping that speech into frames of the order of 10 to 50 milliseconds in length and ignoring the possibility of any variation in the LPC parameters during that frame. It is also satisfactory in almost all instances to characterize the mechanical resonances by limiting the allowed number of reflection coefficients to ten.
Various transformations of the reflection coefficients have been used to describe the filter equivalent of the vocal tract. One that is of particular value is the logarithmic area ratio, abbreviated LAR which is the logarithm of the ratio of the magnitude of (1+k) to the magnitude of (1-k). A typical frame of speech that has been reduced to linear predictive coding will comprise a header to indicate the start of the frame and a number of binary digits during a period of time of the order of the frame period that signal voiced or unvoiced, energy level, pitch, and ten LARs. While the computation time necessary to determine the pitch and the LARs is of the order of the period of a frame or less, the systems of calculation that are available require information from a number of adjacent frames. For this reason, the information that is being sent to describe an LPC signal normally runs five, ten, or even more frames behind the speaker. Such a time delay is imperceptible to a listener so that the coding is properly described as taking place in real time.
A typical frame of speech that has been encoded in digital form using linear predictive coding will have a specified allocation of binary digits to describe the gain, the pitch and each of ten LARs. Each successive pair of LARs represents the effect upon the signal of adding an additional acoustic resonator to the filter. Limitation of the number of LARs to ten is in recognition of the fact that each additional reflection coefficient becomes progressively less significant than the preceding reflection coefficient and that ten LARs usually represent a satisfactory place to cut off the modeling without serious loss of response. The inclusion of more LARs would provide a marginal improvement in fidelity of response, but the difference between ten LARs and twelve seldom produces a detectable difference in the resulting speech. Furthermore, eight or even fewer LARs are often adequate for understanding. This makes it possible to use a system such as that of the present invention which uses redundancy to reduce the average bit rate and makes a temporary sacrifice of fidelity from time to time when it becomes necessary to reduce the bit rate below the average.
Systems for linear predictive coding that are presently in use have different frame periods and bit allocations. A typical such system is summarized in Table I which is a bit allocation for speech that was treated in frames 12.5 milliseconds in length. This corresponds to 80 frames per second. The voiced-unvoiced decision is encoded as one of the pitch levels so that a separate bit for voicing is not needed.
TABLE I ______________________________________ Bit Allocation for a Typical LPC System LPC Parameter Bits ______________________________________ Gain 5 Pitch 6 LAR 1 6 LAR 2 6 LAR 3 5 LAR 4 5 LAR 5 5 LAR 6 4 LAR 7 4 LAR 8 4 LAR 9 4 LAR 10 4 TOTAL 58 ______________________________________
Table I lists a total of 58 bits per frame which, for a frame width of 12.5 milliseconds, would represent a bit rate of 4640 bits per second. The addition of two more bits per frame that is necessary for synchronization raises the bit total to 60 and the bit rate to 4800 bits per second. The use of ten LARs in a frame length of 12.5 milliseconds gives excellent speaker recognition. It is desirable to retain that speaker recognition by retaining the same frame length and the same number of LARs and by keeping the same quantization of the LPC coefficients, all at a reduced bit rate.
One method of reducing the data rate for speech is to use a technique called Variable-Frame-Rate (VFR) coding. This technique has been described by E. Blackman, R. Viswanathan, and J. Makhoul in "Variable-to-Fixed Rate Conversion of Narrowband LPC Speech," published in the Proceedings of the 1977 IEEE International Conference on Acoustics, Speech, and Signal Processing. In VFR coding the first four LARs are examined every frame time. If none of the four LARs has a different quantized value, no information is transmitted for that frame. If any one of the four LARs has a change in quantized value from the last transmitted frame, then all of the parameters are transmitted for that frame. Hence, in this technique, all or none of the LPC coefficients are transmitted at each frame time. Since in some frames no data are transmitted, the resulting data rate is reduced. This has the disadvantage that if one LPC coefficient changes, all are sent, regardless of whether others may not have changed.
It is an object of the present invention to reduce the bandwidth necessary to send digital data.
It is a further object of the present invention to use redundancies of speech to reduce the bandwidth necessary to send a digital characterization of speech.
It is a further object of the present invention to transmit speech subjected to linear predictive coding in an average bandwidth that is less than the maximum bandwidth needed to transmit the encoded speech.
Other objects will become apparent in the course of a detailed description of the invention.