The ability to code speech at low bit rates without sacrificing voice quality is becoming increasingly important in the new digital communications environment. Efficient speech coding methods will determine the success of numerous new applications such as digital encyrption, mobile telephony, voice mail, and speech transmission over packet networks. Speech coding technology for voice quality is now well developed for bit rates as low as 16 kilobits/sec. (This means that 16 kilobits of data are required to code 1 sec. of speech.) Research is now focusing on achieving substantially lower rates, i.e. rates below 9.6 kilobits/sec. It is a major challenge in present applied speech research to achieve low bit rates without degrading speech quality.
One method for coding speech at relatively low bit rates is known as stochastic coding (see for example, Schroeder et al. "Stochastic Coding Of Speech At Very Low Bit Rates, The Importance Of Speech Perception", Speech Communication 4, (1985), 155-162, and Schroeder et al. "Code Excited Linear Prediction (CELP): High Quality Speech At Very Low Bit Rates", IEEE, 1985).
In the stochastic coding method, an analog speech signal to be coded is first sampled at the Nyquist rate (e.g. about 8 kilohertz). The resulting train of samples is then broken-up into short blocks which are stored, each block representing, for example, 5 milliseconds of speech. Illustratively, each block of speech contains 40 samples. The actual speech signal is then coded block by block.
To use stochastic coding, for each block of speech to be coded, 1024 random code sequences are generated. Each random code sequence is multiplied by an amplitude factor and processed by two linear digital filters with time varying filter coefficients. After being processed in the foregoing manner, each code sequence is compared to the block of speech to be coded, and the code sequence which is closest to the actual block of speech is identified. An identification number for the chosen code sequence and information about the amplitude factor and filter coefficients are transmitted from the coder to the receiver.
More particularly, it is well known that a reasonable model for the production of human speech sounds may be obtained by representing human speech as the output of a time varying linear digital filter which is excited by a quasi-periodic pulse train (see for example Atal et al "Adaptive Predictive Coding of Speech Signals", Bell System Technical Journal, vol. 49, pp 1973-1986, Oct. 1970). The output of the digital filter at any sampling instant is a linear combination of the past p output samples and the present input sample.
A digital filter may be represented as a feedback loop which includes a tapped delay line. This delay line comprises a plurality of discrete delays of fixed duration related to the sampling interval mentioned above. Taps are located at uniform intervals along the delay line. The output of each tap is multiplied by a filter coefficient. After multiplication by the filter coefficients, the resulting tap outputs and the present input sample are added to form the filter output. In mathematical terms, the input to the filter is a sequence of weighted impulses. The output of the filter is also a sequence of weighted impulses, each output impulse being formed by adding the delayed outputs from the taps and the present input impulse as described above. The filter may be made time varying by utilizing time dependent filter coefficients.
In the stochastic coding method, a block of speech which illustratively comprises 40 samples may be coded as follows: First, 1024 random code sequences are generated by a code generator. Each sequence contains, for example, 40 elements or samples. After generation, each code sequence is multiplied by an amplitude factor which depends on the amplitudes in the actual block of speech to be coded. Thus, the amplitude factor is adjusted for each block of speech to be coded. After multiplication by the amplification factor, each code sequence is passed through two time varying linear digital filters of the type described above.
As set forth in the references mentioned above, the first filter includes a long delay predictor in its feedback loop and the second filter includes a short delay predictor in its feedback loop. Physically, the first filter generates the pitch periodicity of the human vocal cords and the second filter generates the filtering action of the human vocal track (e.g. mouth, tongue and lips).
The filter coefficients are changed for each block of actual speech to be coded (but not for each code sequence), in accordance with an algorithm known as adaptive predictive coding. This algorithm is discussed in the above-mentioned references and in B. S. Atal "Predictive Coding of Speech at Low Bit Rates", IEEE Trans. Commun. Vol. COM-30, 1982, pp 600-614, and S. Singhal et al "Improving Performance of Multi-pulse LPC Coders at Low Bit Rates", Proc. Int. Conf. on Acoustics, Speech, and Signal Proc., Vol. 1, paper No. 1.3, March 1984.
After multiplication by the amplitude factor and processing by the two digital filters, each of the 1024 random code sequences is successively compared with the actual block of speech to be coded. The processed code sequence which is closest to the actual block of speech is identified. A 10-bit identification number identifying the chosen code sequence and information relating to the amplitude factor and the filter coefficients are then transmitted from the coding device to the receiver. Upon receipt of this information, the receiver retrieves the chosen code sequence from its memory, multiplies the chosen sequence by the transmitted amplitude factor and processes the chosen code sequence through two digital filters using the transmitted filter coefficients to reproduce the actual speech signal.
Using the above described stochastic coding method, high quality synthetic speech has been produced at bit rates as low as 4.8 kilobits/sec. However, computationally, the stochastic coding method is very expensive. According to the foregoing references, it takes 125 sec. of Cray-1 CPU time to process 1 sec. of speech signal. To look at this another way, if one second of actual speech signal is divided-up into 200 five millisecond blocks of 40 samples each, and each of the 1024 random code sequence comprises 40 elements, and the two filters have a total of 19 taps, then the filtering of operations required to code 1 sec. of actual speech, involve EQU 19.times.40.times.1024.times.200=155,648,000
separate computational steps (i.e., multiplies and adds).
Thus, the stochastic coding technique is not particularly suitable for commercial applications. Accordingly, it is an object of the present invention to provide a method for coding speech which, like stochastic coding, achieves bit rates in the 4.8 kilobits/sec range, but which requires significantly less computational resources.