Many automatic speech recognition systems use a pronunciation dictionary to identify particular words contained in received utterances. The term “utterance” is used herein to refer to one or more sounds generated either by humans or by machines. Examples of an utterance include, but are not limited to, a single sound, any two or more sounds, a single word or two or more words. In general, a pronunciation dictionary contains data that defines expected pronunciations of utterances. Each pronunciation comprises a set of phonemes. Each phoneme is defined using a plurality of acoustic models, each of which comprises values for various audio and speech characteristics that are associated with a phoneme.
When an utterance is received, the received utterance, or at least a portion of the received utterance, is compared to the expected pronunciations contained in the pronunciation dictionary. An utterance is recognized when the received utterance, or portion thereof, matches the expected pronunciation contained in the pronunciation dictionary. Recognition involves determining that phonemes identified in the utterance match acoustic models of corresponding phonemes of a particular vocabulary word, within predefined bounds of tolerance.
Often acoustic models are modified or “trained” based on actual received utterances in order to improve the ability of the speech recognizer to discriminate among different phonetic units. Although each acoustic model is associated with a particular phoneme, a dictionary based on such acoustic models may have several entries that sound similar or comprise similar sets of phonemes. These vocabulary words may be difficult for the speech recognizer to distinguish. Confusion among such words can cause errors in an application with which the speech recognizer is used.
One reason that such confusion can occur is that acoustic models are normally trained using generic training information, without reference to the context in which the speech recognizer or a related application is used. As a result, the speech recognizer lacks information that can be used to discriminate between phonemes or other phonetic units that may be particularly relevant to the specific task with which the speech recognizer is used.
For example, the English words AUSTIN and BOSTON sound similar and may be difficult for a speech recognizer to distinguish. If the speech recognizer is used in an airline ticket reservation system, and both AUSTIN and BOSTON are in the vocabulary, confusion of AUSTIN and BOSTON may lead to ticketing errors or user frustration.
As another example, consider the spoken numbers FIFTY and FIFTEEN. If the speech recognizer is used in a stock trading system, confusion of FIFTY and FIFTEEN may lead to erroneous orders or user frustration.
Examples of prior approaches that use generic modeling include, for example:
B. Juang et al., “Discriminative Learning for Minimum Error Classification,” IEEE Transactions on Signal Processing 40:12 (December 1992), at 3043;
Q. Huo et al., “A Study of On-Line Quasi-Bayes Adaptation for CDHMM-Based Speech Recognition,” IEEE Trans. on Speech and Audio Processing, vol. 2, at 705 (1996);
A. Sankar et al., “An Experimental Study of Acoustic Adaptation Algorithms,” IEEE Trans. on Speech and Audio Processing, vol. 2, at 713 (1996);
L. Bahl et al., “Discriminative Training of Gaussian Mixture Models for Large Vocabulary Speech Recognition Systems,” IEEE Trans. on Speech and Audio Processing, vol. 2, at 613 (1996).
The approaches outlined in these references, and other prior approaches, have significant drawbacks and disadvantages. For example, the prior approaches are applied only in the context of frame-based speech recognition systems that use hidden Markov models. None of the prior approaches will work with a segment-based speech recognition system. A fundamental assumption of those methods is that the same acoustic features are used to match every phrase in the recognizer's lexicon. In a segment-based system, the segmentation process produces a segment network where each hypothesis independently chooses an optimal path though that network. As a result, different hypotheses are scored against differing sequences of segment features, rather than all of them being scored relative to a common sequence of frame features.
In segment-based systems, alternate acoustic features, which are different from the primary features described above and again differ from one hypothesis to the next, are sometimes used. There is a need for an approach that enables discrimination between acoustic units of secondary features where the prior approaches still fail.
In addition, the prior approaches generally rely on manual human intervention to accomplish training or tuning. The prior approaches do not carry out discriminative training automatically based on utterances actually experienced by the system.
Based on the foregoing, there is a need for an automated approach for training an acoustic model based on information that relates to the specific application with which a speech recognition system is used.
There is a particular need for an approach for training an acoustic model in which a speech recognizer is trained to discriminate among phonetic units based on information about the particular application with which the speech recognizer is being used.