The compression of speech, in view of the possible economic gains, has attracted considerable attention. H. W. Dudley's dedicated efforts in this area during his 40 years at Bell Telephone Laboratories and his contributions from the basis for most subsequent work regarding conventional vocoders. H. W. Dudley, "The Vocoder", Bell Laboratories Record, Vol. 17, 1939.
The conventional band-compression speech system based on analysis-synthesis experiments of Dudley was called vocoder (voice coder) and is now known as the spectrum channel vocoder.
Other vocoder systems have been built wherein the pitch and excitation information is either extracted, coded, transmitted, and synthesized, or transmitted in part and expanded as in the voice-excited methods. The amplitude spectrum may be transmitted by circuits that track the formants, determine which of a number of present channels contain power and to what extent, or determine its amplitude spectrum by some suitable transform such as the correlation function, and transmit and synthesize the spectrum information by such means. These approaches give rise to such systems as the auto correlation vocoder, the formant vocoder, and the voice-excited formant vocoder.
Many other methods of speech compression have been tried, such as frequency division multiplication and time compression and expansion procedures, but these systems generally become more sensitive to transmission noise. See "Reference Data for Radio Engineers", 6th Ed. H. Sams 1982, p.37-33 to 37-36. For examples of these conventional vocoder techniques and attendant problems, look to "Digital Coding of Speech Waveforms" by N. S. Jayant, Proceedings of the IEEE, Vol. 62, pp. 611-632, May, 1974.
The advantages of coding a signal digitally are well-known and are widely discussed in the literature. Briefly, digital representation offers ruggedness, efficient signal regeneration, easy encryption, the possibility of combining transmission and switching functions, and the advantage of a uniform format for different types of signals. The price paid for these benefits is the need for increased bandwidths.
More recent research has produced a linear predictive analysis given a sampled (discrete-time) signal s(n), a powerful and general parmetric model for time series analysis which is, in that case, a signal prediction or reconstruction model, give by: ##EQU1## where s(n) is the outputand u(n) is the input (perhaps unknown). The model parameters are a(k) for k=1, p, b(l) for l=1, q, and G, b(0) is assumed to be unity. This model, described as an autoregressive moving average (ARMA) or pole-zero model, forms the foundation for the analysis method termed linear prediction. An autoregressive (AR) or all-pole model, for which all of the "b" coefficients except b(0) are zero, is frequently used for speech analysis. In the case of stochastic signals, such as speech, u(k) can be shown to be equivalent to inaccessible white noise without loss of generality, as is the usage in this description. (see, Chapter 4 of Graupe, "Time Series Analysis Identification and Adaptive Filtering", Krieger Publishing Co., Malabar, Fla., 1989 (2nd edition)).
In the standard AR formation of linear prediction, the model parameters are selected to minimize the means-squared error between the model and the speech data. In one of the variants of linear prediction, the auto correlation method, the minimization is carried out for a windowed segment of data. In the auto correlation method, minimizing the means-square error of the time domain samples is equivalent to minimizing the integrated ratio of the signal spectrum to the spectrum of the all-pole model. Thus, linear predictive analysis is a good method for spectral analysis whenever the signal is produced by an all-pole system. Most speech sounds fit this model well. One key consideration for linear predictive analysis is the order of the model, p. For speech, if the order is too small, the formant structure is not well represented, If the order is too large, pitch pulses as well as formants begin to be represented. Tenth- or twelfth-order analysis is typical for speech. See, "The Electrical Engineering Handbook", pp. 302-314, CRC Press, 1993.
Telephone quality speech is normally sampled at 8 KHz and quantized at 8 bit/sample (a rate of 64 kbits/s) for uncompressed speech. Simple compression algorithms like adaptive differential pulse code modulation (ADPCM) use the correlation between adjacent samples to reduce the number of bits used by a factor of two to four or more with almost imperceptible distortion. Much higher compression ratios can be obtained with linear predictive coding (LPC), which models speech as an autoregressive process, and send the parameters of the process as opposed to sending the speech itself. One reference for LPC is "Neural Networks for Speech Processing", by D. P. Morgan and C. L. Scofield, Chapter 4, Kluger Publishing, Boston, Mass., 1991. With conventional LPC-based methods, it is possible to code speech at less than 4 kbits/s. At very low rates, however, the reproduced speech sounds synthetic and the speaker's identifiability is totally lost. The present invention successfully overcomes these obstacles allowing heretofore unknown bit rates and speech sound quality.