Digital speech communication systems including voice storage and voice response systems use speech coding and data compression techniques to reduce the bit rate needed for storage and transmission. Voiced speech is produced by a periodic excitation of the vocal tract by the vocal chords. As a consequence, a corresponding signal for voiced speech contains a succession of similarly but evolving waveforms having a substantially common period which is referred to as the pitch period. Typical speech coding systems take advantage of short-term redundancies within a pitch period interval to achieve data compression in a coded speech signal.
In a typical voice coder (vocoder) system, such as that described in U.S. Pat. No. 3,624,302, which is incorporated by reference herein, the speech signal is partitioned into successive fixed duration intervals of 10 msec. to 30 msec. and a set of coefficients are generated approximating the short-term frequency spectrum resulting from the short-term redundancies or correlation in each interval. These coefficients are generated by linear predictive analysis and referred to as linear predictive coefficients (LPC's). The LPC's represent a time-varying all-pole filter that models the vocal tract. The LPC's are useable for reproducing the original speech signal by employing an excitation signal referred to as a prediction residual. The prediction residual represents a component of the original speech signal that remains after removal of the short-term redundancy by linear predictive analysis.
In vocoders, the prediction residual is typically modeled as white noise for unvoiced sounds and a periodic sequence of impulses for voiced speech. A synthesized speech signal can be generated by a vocoder synthesizer based on the modeled residual and the LPC's of the linear predictive filter modeling the vocal tract. Vocoders approximate the spectral information of an original speech signal and not the time-domain waveform of such a signal. Moreover, a speech signal synthesized from such codes often exhibits a perceptible synthetic quality that is, at times, difficult to understand.
Alternative known speech coding techniques having improved perceptual speech quality approximate the waveform of a speech signal. Conventional analysis-by-synthesis systems employ such a coding technique. Typical analysis-by-synthesis systems are able to achieve synthesized speech having acceptable perceptual quality. Such systems employ both linear predictive analysis for coding the short-term redundant characteristics of the pitch period as well as a long-term predictor (LTP) for coding long term pitch correlation in the prediction residual. In LTP's, characteristics of past pitch periods are used to provide an approximation of characteristics of a present pitch period. Typical LTP's have included an all-pole filter providing delayed feedback of past pitch-period characteristics, or a codebook of overlapping vectors of past pitch-period characteristics.
In particular analysis-by-synthesis systems, the prediction residual is modeled by an adaptive or stochastic codebook of noise signals. The optimum excitation is found by searching through the codebook of candidate excitation vectors for successive speech intervals referred to as frames. A code specifying the particular codebook entry of the found optimum excitation is then transmitted on a channel along with coded LPC's and the LTP parameters. These particular analysis-by-synthesis systems are referred to as code-excited linear prediction (CELP) systems. Exemplary CELP coders are described in greater detail in B. Atal and M. Schroeder, “Stochastic Coding of Speech Signals at Very Low Bit Rates”, Proceedings IEEE Int. Conf Comm., p. 48.1 (May 1984); M. Schroeder and B. Atal, “Code-Excited Linear Predictive (CELP): High Quality Speech at Very Low Bit Rates”, Proc. IEEE Int. Conf ASSP., pp. 937-940 (1985) and P. Kroon and E. Deprettere, “A Class of Analysis-by-Synthesis Predictive Coders for High-Quality Speech Coding at Rate Between 4.8 and 16 KB/s”, IEEE J on Sel. Areas in Comm., SAC-6(2), pp. 353-363 (Feb. 1988), which are all incorporated by reference herein.
However, in vocoder and analysis-by-synthesis systems as well as other types of speech coding systems, there is a recognized need for methods of coding characteristics of the short-term frequency spectrum with enhanced perceptual accuracy.