Speech coding approaches which are known in the art include:
Taguchi (U.S. Pat. No. 4,301,329) Itakura et al. (U.S. Pat. No. 4,393,272) Ozawa et al. (U.S. Pat. No. 4,716,592) Copperi et al. (U.S. Pat. No. 4,791,670) Bronson et al. (U.S. Pat. No. 4,797,926) Atal et al. (Re. U.S. Pat. No. 32,590) PA0 C. G. Bell et al., "Reduction of Speech Spectra by Analysis-by-Synthesis Techniques," J Acoust Soc Am, Vol 33, Dec. 1961, pp. 1725-1736 PA0 F. Itakura, "Line Spectrum Representation of Linear Predictive Coefficients of Speech Signals," J Acoust Soc Am, Vol. 57, Supplement No. 1, 1975, p. 535 PA0 G. S. Kang and L. J. Fransen, "Low-Bit-Rate Speech Encoders Based on Line Spectrum Frequencies (LSFs)", Naval Research Laboratory Report No. 8857, Nov. 1984 PA0 S. Maitra and C. R. Davis, "Improvements on the Classical Model for Better Speech Quality," IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 23-27, 1980 PA0 M. Yong, G. Davidson and A. Gersho, "Encoding of LPC Spectral Parameters Using Switched-Adaptive Interframe Vector Prediction", pp. 402-405, Dept. of Electrical and Computer Engineering, Univ. of California, Santa Barbara, 1988 PA0 M. R. Schroeder and B. S. Atal, "Code-Excited Linear Prediction (CELP) High-Quality Speech at Very Low Bit Rates", pp. 937-940, 1985 PA0 B. S. Atal and J. R. Remde, "A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates", pp. 614-617, 1982 PA0 L. R. Rabiner, M. J. Cheng, A. E. Rosenberg and C. A. McGonegal, "A Comparative Performance Study of Several Pitch Detection Algorithms", IEEE Trans. Acoust., Speech, and Signal Process., vol. ASSP-24, pp. 399-417, Oct. 1976 PA0 J. P. Campbell, Jr,. T. E. Termain, "Voiced/Unvoiced Classification of Speech With Applications to the U.S. Government LPC-10E Algorithm", ICASSP 86, TOKYO, pp. 473-476, (undated) PA0 P. Kroon and B. S. Atal, "Pitch Predictors with High Temporal Resolution", Proc. IEEE ICASSP, 1990, pp. 661-664 PA0 F. F. Tzeng, "Near-Toll-Quality Real-Time Speech Coding at 4.8 KBIT/s for Mobile Satellite Communications:, pp. 1-6, 8th International Conference on Digital Satellite Communications, April 1989
The teachings of the above and any other references mentioned throughout the specification are incorporated herein by reference for the purpose of indicating the background of the invention and/or illustrating the state of the art.
A 2.4 kbps linear predictive speech coder, with an excitation model as shown in FIG. 1 (indicated as 100), has found wide-spread military and commercial applications. A spectrum synthesizer 102 (e.g., a 10th-order all-pole filter), used to mimic a subject's speech generation (i.e., vocal) system, is driven by a signal from a G gain amplifier 104, to produce reconstructed speech. The gain amplifier 104 receives and amplifies a signal from a voiced/unvoiced (V/UV) determination means 106. With respect to an operation of the voiced/unvoiced determination means, for each individual speech frame, a decision is made as to whether the frame of interest is a voiced or an unvoiced frame.
The voiced/unvoiced determination means makes a "voiced" determination, and correspondingly switches a switch 107 to a "voiced" terminal, during times when the sounds of the speech frame of interest are vocal cord generated sounds, e.g., the phonetic sounds of the letters "b", "d", "g", etc. In contrast, the voiced/unvoiced determination means makes an "unvoiced" determination and correspondingly switches the switch 107 to an "unvoiced" terminal during times when the sounds of the speech frame of interest are non-vocal cord generated sounds, e.g., the phonetic sounds of the letters "p", "t", "k", "s", etc. For a voiced frame, a pulse train generator 108 estimates a pitch value of the speech frame of interest, and outputs a pulse train, with a period equal to the pitch value, to the voiced/unvoiced determination means for use as an excitation signal. For an unvoiced frame, a Gaussian noise generator 110 generates and outputs a white Gaussian sequence for use as an excitation signal.
A typical bit allocation scheme for the above-described model is as follows: For a speech signal sampled at 8 KHz, and with a frame size of 180 samples, the available data bits are 54 bits per frame. Out of the 54 bits, 41 bits are allocated for the scalar quantization of ten spectrum synthesizer coefficients (5,5,5,5,4,4,4,4,3 and 2 bits for the ten coefficients, respectively), 5 bits are used for gain coding, 1 bit to specify a voiced or an unvoiced frame, and 7 bits for pitch coding.
This above-described approach is generally referred to in the art as an LPC-10. Such LPC-10 coder is able to produce intelligible speech, which is very useful at a low data rate. However, the reconstructed speech is not natural enough for many other applications.
The major reason for the LPC-10's limited success is the rigid binary excitation model which it adopts. However, at 2.4 kbps, use of an over-simplified excitation model is a necessity. As a result of the arrangement of the LPC-10, performance depends critically on a correct V/UV decision and accurate pitch estimation and tracking. Many complicated schemes have been proposed for the V/UV decision and pitch estimation/tracking; however, no completely satisfactory solutions have been found. This is especially true when the desired speech signal is corrupted by the background acoustic noises, or when a multi-talker situation occurs.
Another drawback of the LPC-10 approach is that when a frame is determined as unvoiced, the seven bits allocated for the pitch value are wasted. Also, since open-loop methods are used for the V/UV decision and pitch estimation/tracking, the synthesized speech is not perceptually reconstructed to mimic the original speech, regardless of the complexity of the V/UV decision rule and the pitch estimation/tracking strategy. Accordingly, the above-described scheme provides no guarantee of how close the synthesized speech will be to the original speech in terms of some pre-defined distortion measures.