The present invention generally relates to a method of encoding an analog speech signal via speech analysis wherein formant candidates of speech signals are extracted in real time, and more particularly to the real-time root factoring of the linear prediction (LPC) polynomial describing the spectrum of speech signals, wherein the roots are candidates in determining the formants of the vocal tract, and the implementation of the method in a formant-based speech recognition system. Alternatively, the method may be implemented in narrow band speech encoding and in interactive data preparation for a speech synthesis system.
Speech analysis, wherein a frame of sampled speech in digital form is analyzed to extract the information content thereof, has been accomplished by various techniques as a means of reducing the speech data rate required to encode an analog speech signal to more nearly approximate the actual information content in its audible form as heard by a human or by some form of electronic pick-up or receiver device. Speech analysis as generally described hereinabove enables analog speech signals to be placed in a compressed digitized form for storage and transmission as speech signals using a reduced bandwidth. Speech encoding as provided by appropriate speech analysis produces a significant compression in the speech signal as derived from the original analog speech signal which can be utilized to advantage in the general synthesis of speech, in speech recognition and in the transmission of spoken speech.
A technique known as linear predictive coding is commonly employed in the analysis of speech. This technique is based upon the following relation: ##EQU1## where s.sub.n is a signal considered to be the output of some system with some unknown input u.sub.n, with a.sub.k, 1.ltoreq.k.ltoreq.p, b.sub.1, 1.ltoreq.l.ltoreq.q, and the gain G are the parameters of the hypothesized system. In equation (1), the "output" s.sub.n is a linear function of past outputs and present and past inputs. Thus, the signal s.sub.n is predictable from linear combinations of past outputs and inputs, whereby the technique is referred to as linear prediction.
By taking the z transform on both sides of equation (1), where H(z) is the transfer function of the system, the following relationship is obtained: ##EQU2## is the z transform of s.sub.n, and U(z) is the z transform of u.sub.n. In equation (2), H(z) is the general pole-zero model, with the roots of the numerator and denominator polynomials being the zeros and poles of the model, respectively. Linear predictive modeling generally has been accomplished by using a special form of the general pole-zero model of equation (2), namely--the autoregressive or all-pole model, where it is assumed that the signal s.sub.n is a linear combination of past values and some input u.sub.n, as in the following relationship: ##EQU3## where G is a gain factor. The transfer function H(z) in equation (2) now reduces to an all-pole transfer function ##EQU4## Given a particular signal sequence s.sub.n, speech analysis according to the all-pole transfer function of equation (5) produces the predictor coefficients a.sub.k and the gain G as speech parameters.
It has long been known that certain speech sounds, most notably the vowels, may be identified and synthesized from a knowledge of the formant frequencies or speech formants in the analysis and perception of speech. See for example, "Automatic Extraction of Formant Frequencies from Continuous Speech"--Flanagan, appearing in Journal of the Acoustical Society of America, Vol. 28, pp. 110-118 (Jan. 1956) and "System for Automatic Formant Analysis of Voiced Speech"--Schafer and Rabiner, appearing in Journal of the Acoustical Society of America, Vol. 47, pp. 634-648 (Feb. 1970), each of which is hereby incorporated by reference. In this respect, formant frequency data contains more inherent speech intelligence than reflection coefficient data which is the usual form of the speech parameters employed in the linear predictive coding of speech. To this end, efforts have been continuously directed toward the extraction of formant frequencies from continuous speech signals as a basis of speech analysis in which a high degree of speech intelligence is contained within the extracted formant frequencies for use in subsequent speech synthesis, speech recognition or speech data transmission. Heretofore, the extraction of formant frequency data from sampled digital speech data has been recognized as a desirable goal, but efforts to achieve real time determination of speech formants have not been generally regarded as satisfactory.