Most existing speech recognition systems pre-process input speech prior to actual processing needed for speech recognition without using knowledge of the speaker. The prior art systems create a spectrum by a number of techniques, such as linear predictive coding, bandpass filtering, transforms (particularly Fast Fourier Transforms), and time domain analysis, such as zero crossing counts. These technologies have varying disadvantages, but are done in a way that does not include any information about the speech characteristics of the speaker and therefore use no speaker-specific parameters which are estimated from an independent body of speech.
Bandpass filters use fixed frequency bands. For example, Lokerson (U.S. Pat. No. 4,039,754) uses three bandpass filters of ranges 336-742 Hz, 574-2226 Hz, and 1750-3710 Hz to correspond to typical ranges of the first, second, and third formants of speech. Thus for example, the second filter in a set of bandpass filters will have a different meaning for a speaker who has a high first formant than for a speaker who has a lower first formant. Since the formants are energy peaks of the speech and depend upon the physical makeup of the speaker, the locations of these energy peaks will vary from speaker to speaker. Therefore, the locations of these frequency peaks will vary from speaker to speaker, and will appear in different bands from one speaker to another.
Further, a set of fixed bandpass filters must have a fixed range of coverage. Therefore, the set must have a minimum band which covers the lowest frequency range that it expects to be able to treat and a maximum band which covers the highest frequency range which it expects to treat. Because this range of values is determined without reference to a specific speaker, some bands will be of minimal, if any, value for any single speaker. This adds noise to the analysis since these bands are not meaningful for the particular speaker and waste system resources.
Linear Predictive Coding (LPC) is a method of approximating the spectrum of a signal by fitting that spectrum with a representation characterized by a fixed number of parameters. For example, a tenth-order LPC implementation might be used in a typical speech processing application, allowing ten parameters to fit to the spectrum over every time interval. A difficulty in utilizing LPC when the recognition technique is based upon typical pattern recognition technology i: that a given LPC coefficient does not have the same meaning from speaker to speaker or even from speech frame to speech frame of the same speaker. For example, the second LPC coefficient may at one time fit one portion of the spectrum and at another time another portion of the spectrum. Thus, it is very difficult to interpret an LPC coefficient as having a specific meaning even when utilized with a single speaker. The variation in LPC coefficients from speaker to speaker is even greater.
Transforms such as Fast Fourier Transforms or Hadamard transforms can be viewed as a series of equally spaced and narrow bandpass filters. The disadvantages of ;sing such transforms are similar to that of bandpass filters, but to some degree magnified because there are more such filters.
Pitch tracking is used in some speech processing systems. Pitch tracking detects the pitch period information that can be used in speech recognition as has been proposed by Lea, Trends in Speech Recognition, Prentice Hall, 1980, pp. 166-205. Pitch information can also be used to smooth some of the data by removing the modulation of those parameters by the pitch frequency. Pitch tracking can further be used to "pitch-synchronize" the data so that the data that is utilized in a speech recognition system is a set of parameters for each pitch period rather than for an arbitrary time period.
Pitch tracking for creating pitch-synchronous data is motivated in part by the following logic. The pitch period of a speaker is determined by the characteristics of the speaker's vocal cords. For a given speaker, the pitch period can vary by a factor of four from the lowest to highest pitch period depending upon the sound being spoken, the stress placed upon the word, and the position in the sentence of the word. From speaker to speaker, the average pitch also varies greatly. For example, it is well known that females on the average have a shorter pitch period than males. This variability in pitch makes it impossible to pick a single time period for analyzing the spectrum of the data which always includes exactly one pitch period. Spectral analysis in equal time intervals creates distortion in the spectrum and in some cases averages out information that is important. Further, the amount of data created by a fixed sampling period will be unrelated to the information in the signal. For a low pitch, the spectrum can be calculated less frequently and yet contain all the relevant information in the signal. For a high pitch, the information must typically be sampled more frequently to contain all the relevant information in the signal. This accounts in part for some recognition systems having more difficulty with female voices than with male voices.
Approaches to pitch tracking have varied greatly, but they all suffer from one major defect that seriously reduces their effectiveness. Because they assume no knowledge of the speaker, they must be adaptive or highly general in order to cover the wide range of pitch that can and will be encountered. In attempting to maintain such generality, they are typically either less accurate or more computational, hence slower, than is acceptable.