Speech processing systems digitally encode an input speech signal before additionally processing the signal. Speech encoders may be generally classified as either waveform coders or voice coders (also called vocoders). Waveform coders can produce natural sounding speech, but require relatively high bit rates. Voice coders have the advantage of operating at lower bit rates with higher compression ratios, but are perceived as sounding more synthetic than waveform coders. Lower bit rates are desirable in order to more efficiently use a finite transmission channel bandwidth. Speech signals are known to contain significant redundant information, and the effort to lower coding bit rates is in part directed towards identifying and removing such redundant information.
Speech signals are intrinsically non-stationary, but they can be considered as quasi-stationary signals over short periods such as 5 to 30 msec, generally known as a frame. Some particular speech features may be obtained from the spectral information present in a speech signal during such a speech frame. Voice coders extract such spectral features in encoding speech frames.
It is also well known that speech signals contain an important correlation between nearby samples. This redundant short term correlation can be removed from a speech signal by the technique of linear prediction. For the past 30 years, such linear predictive coding (LPC) has been used in speech coding, in which the coding defines a linear predictive filter representative of the short term spectral information which is computed for each presumed quasi-stationary segment. A general discussion of this subject matter appears in Chapter 7 of Deller, Proakis & Hansen, Discrete-Time Processing of Speech Signals (Prentice Hall, 1987), which is incorporated herein by reference.
A residual signal, representing all the information not captured by the LPC coefficients, is obtained by passing the original speech signal through the linear predictive filter. This residual signal is normally very complex. In early LPC coders, this complex residual signal was grossly approximated by making a binary choice between a white noise signal for unvoiced sounds, and a regularly spaced pulse signal for voiced sounds. Such approximation resulted in a highly degraded voice quality. Accordingly, linear predictive coders using more sophisticated encoding of the residual signal have been the focus of further development efforts.
All such coders could be classified under the broad term of residual excited linear predictive (RELP) coders. The earliest RELP coders used a baseband filter to process the residual signal in order to obtain a series of equally spaced non-zero pulses which could be coded at significantly lower bit rates than the original signal, while preserving high signal quality. Even this signal can still contain a significant amount of redundancy, however, especially during periods of voiced speech. This type of redundancy is due to the regularity of the vibration of the vocal cords and lasts for a significantly longer time span, typically 2.5-20 msec., than the correlation covered by the LPC coefficients, typically &lt;2 msec.
In order to avoid the low speech quality of the original LPC coders and the simple baseband RELP coder's sub-optimal bit efficiency due to the limited flexibility of the residual modeling, many of the more recent speech coding approaches may be considered more flexible applications of the RELP principle, with a long-term predictor also included. Examples of such include the Multi-Pulse LPC arrangement of Atal, U.S. Pat. No. 4,701,954, the Algebraic Code Excited Linear Prediction arrangement of Adoul, U.S. Pat. No. 5,444,816, and the Regular-Pulse Excited LPC coder of the GSM standard.