The present invention generally relates to a method and system of encoding digital speech information so as to achieve an economical representation of speech with the least possible loss of quality, thereby providing speech transmission in a vocoder-type system or simply a speech synthesis system with a reduced bit rate while retaining speech quality in the audible reproduction of the encoded digital speech information. More particularly, the present invention is directed to a method and system employing Markov modeling and Huffman coding on quantized speech parameter values, wherein the speech parameter values may be indicative of linear predictive coding pitch, energy and reflection coefficients, to improve the coding efficiency by providing an optimal reduction in the speech data rate while the speech quality in the audible reproduction of the speech data remains unaffected.
Linear predictive coding (LPC) is a well known method of digitally coding speech information in widespread use in vocoder and speech synthesis systems from which audible synthesized speech can be reproduced. LPC is based upon the recognition that in speech, sound patterns constituting the speech tend to be relatively consistent for long periods of time. A typical frame of speech that has been encoded in digital form using linear predictive coding will have a specified allocation of binary digits to describe the gain, the pitch and each of ten reflection coefficients characterizing the lattice filter equivalent of the vocal tract in a speech synthesis system. The use of ten reflection coefficients as speech parameters in the analysis and synthesis of speech is arbitrary. In the latter connection, the adding of more reflection coefficients also increases the memory storage requirements of the system, along with the fact that each additional reflection coefficient is of progressively less significance in contributing to audible speech quality than the preceding reflection coefficient. Thus, the use of ten reflection coefficients as speech parameters may be generally regarded as a satisfactory number to achieve high quality speech via a linear predictive coding without unnecesarily adding to the memory storage requirements. Although the inclusion of more reflection coefficients as speech parameters would provide a marginal improvement in the quality of audible speech to be derived therefrom, the actual detectable difference in the resulting audible speech is for practical purposes unnoticeable. Furthermore, it is possible to achieve adequate speech quality using a linear predictive coding technique where the number of reflection coefficients defining speech parameters is less than ten, e.g. such as eight or even a lower number of reflection coefficients.
Systems for linear predictive coding as heretofore contemplated have included different frame lengths and bit allocation, such as that described in U.S. Pat. No. 4,209,836 Wiggins, Jr. et al issued June 24, 1980 which assigns differing bit lengths for the respective speech parameters including gain, pitch and the ten reflection coefficients described therein. The use of ten reflection coefficients as speech parameters in a speech analysis and/or speech synthesis system relying upon linear predictive coding produces audible speech of excellent quality. It would be desirable to retain the same degree of speech quality in such a speech analysis and/or speech synthesis system by retaining the same number of reflection coefficients as speech parameters with the same quantization levels, but at a reduced bit rate.
Heretofore, such an effort to reduce the data rate for speech in a vocoder system without a proportional deterioration of the speech quality has been concentrated on the choice of the appropriate speech parameters. In this connection, attempts have been made to select the speech parameters for coding which are most closely associated with human perception, with the less relevant speech parameter information being discarded so as to achieve effective low bit rate coding. Where a vocoder is involved, such attempts are directed to adequately represent the speech spectral envelope representing the vocal tract filter and to represent the filter excitation and energy with the lowest possible speech parameter information necessary to provide audible speech of reasonable quality. This approach results in a static representation of the speech production model which ignores the dynamic evolution of the speech waveform and causes deterioration in the speech quality to be achieved therefrom.
Some attempts have been made to capitalize upon the dynamic behavior of speech. One such technique of reducing the data rate for speech is referred to as variable-frame-rate (VFR) coding as described in "Variable-to-Fixed Rate Conversion of Narrowband LPC Speech"--Blackman et al, Proceedings of the 1977 IEEE International Conference on Acoustics, Speech, and Signal Processing. In so-called VFR coding, a first number of reflection coefficients, e.g. four reflection coefficients, is examined every frame time. If none of this first number of reflection coefficients has a different quantized value, no information is transmitted for that frame. If any one of the number of reflection coefficients has a change in quantized value from the last transmitted frame, then all of the speech parameters are transmitted for that frame. Thus, all or none of the LPC coefficients are transmitted at each frame time. Since in some frames, no data is transmitted, the result is a reduction in the data rate. While techniques such as this have achieved some positive results in an effective reduction in the data rate without unduly penalizing the quality of the speech to be obtained, further reductions in speech data rate without accompanying degradation of speech quality have not been forthcoming from this approach which may be described as a deterministic modeling of the time behavior of speech.