A common task in automatic speech recognition is to recognize a set of words for any speaker without training the system to each new speaker. This is done by storing the reference word templates in a form that will match a variety of speakers. U.S. Pat. No. 5,822,728 entitled “Multistage Word Recognizer Based On Reliably Detected Phoneme Similarity Regions” and assigned to the Assignee of the present invention, resulted in word templates being composed of phoneme similarities. In that work, the phoneme similarities were computers using Mahalanobis distance which was expanded with an exponential function and normalized globally over the entire phoneme set. The assumption of U.S. Pat. No. 5,822,728 is that if the speech process can be modeled as a Gaussian distribution, then the likelihood of the phoneme being spoken can be computed.
In the Mahalanobis distance algorithm only relative phonetic unit similarities are computed. This means that even in non-speech segments, there will be high similarity values. Because of this, the Mahalanobis algorithm generally needs to be coupled with a speech detection algorithm so that the similarities are only computed on speech segments.
Accordingly, it is desirable in the art of speech recognition to provide an automatic speech recognition system in which an assumption of Gaussian distribution is not required. Also, it is desirable to provide an automatic speech recognition system in which the subword units to be modeled are not required to be phonemes, but can be of any sound class such as monophones, diphones, vowel groups, consonant groups, or statistically clustered units.
The present invention utilizes a linear discriminant vector which is computed independently for each sound class. At recognition time, a time spectral pattern for the current time interval, and those in the immediate temporal neighborhood are collected together and considered as one large parameter vector. The dot product (also called “inner product”) of this vector and each discriminant vector is computed. The products are then provided as a measure of the confidence that the sound class is present. Since the discriminant vectors are computed separately, a numeric value for one sound class might not have the same meaning as for another sound class. To normalize the values between sound classes, a normalization function is used. According to an embodiment of the present invention, a look-up table is utilized for the normalization function. The look-up table can be computed from histograms of training utterances. The normalization function is computed such that a large negative value (minus A) indicates high confidence that the utterance does not contain the sound class while a large positive value (plus A) indicates high confidence that the utterance does contain the sound class while a “0” indicates no confidence either way.
The normalized similarity values for all sound classes are collected to form a normalized similarity vector.
The normalized similarity vector is then used by a word matcher for comparison with prestored reference vectors in order to determine the words of the input speech utterance.
Further areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.