Speech recognition systems are based on the comparison of templates of digital representations of incoming speech with templates of digital representations of reference speech. In one form of speech recognition system, words are represented through the linear predictive coding (LPC) technique.
The LPC technique is based on the recognition that speech production involves excitation and a filtering process. The excitation is determined by the vocal cord vibration for voiced speech and by turbulence for unvoiced speech. The excitation is then modified by the filtering process of resonance chambers of the vocal tract, including the mouth and nasal passages, and the effects of radiation from the lips. The vocal tract has the effect of resonance at formant frequencies. The vocal cords and lip radiation have the effect of a roll-off of the overall energy of the sound with higher frequencies. For a frame of samples of speech, a digital filter can be defined which simulates the formant effects of the vocal tract and the slope function of the vocal cords and the radiation from the lips. The frame of speech can then be defined by that filter and a residual signal which approximates the excitation.
In the LPC technique, speech sound is modelled as an all pole filter excited by an impulse train. The all pole filter is ##EQU1## where H(Z) is the approximation of the formant and slope function filter and A(Z) is the inverse, LPC filter of the system. The filter is defined by the prediction coefficients a.sub.i in a polynomial function of z. A frame of speech samples is approximated by an excitation signal and the matrix of coefficients, the LPC vector, a.sub.i. A series of LPC vectors and the excitation function can be derived from sequential frames of speech samples to define a unit of speech such as a word. By comparing the template of LPC vectors generated from an unknown unit of speech with a set of reference templates of a known unit of speech, the unknown unit of speech can be identified.
Because of differences in words spoken by different individuals and by a particular individual at different times, there will not be an exact match between the generated template and a reference template. To minimize the effects of the speed at which words are spoken, a dynamic programming technique has been developed which provides for nonlinear time alignment, or time warping, of individual LPC vectors to bring each vector into closer correspondence with the vector of the template to which it is being compared. Sakoe and Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition", IEEE Trans ASSP, Vol 26, pp. 43-49, 1978.
Another approach to speech recognition uses a direct spectral domain representation, either a discrete filter-bank or the discrete power spectrum generated in a Fourier transform of a speech frame. The template of the transform coefficients of successive frames of speech can be compared to like reference templates to identify a word. In one application of this approach, the slope function of the frequency response is removed and the comparison of templates is based on the fine harmonics of the speech and on the formant frequencies. To allow for shifts in frequency resulting from different speakers, a dynamic programming technique incorporating frequency warping algorithms has been developed to provide spectral warping of each frame of test speech against a reference template. In the spectral warping, a nonlinear spectral shift in which, for example, lower frequencies are expanded and higher frequencies are compressed along the frequency axis has been found to provide better results. Matsumoto and Wakita, "Speaker Normalization by Frequency Warping", Speech Research Semi., S79-25, Japan, July 1979.