This invention relates to a method and apparatus for updating speech models that are used for speech recognition systems and, more particularly, to a technique for the creation of new, or retraining of existing speech models, to be used to recognize speech from a class of users whose speech differs sufficiently from the modeled language of the system that speaker adaptation is not feasible. In this document, the term “speech models” refers collectively to those components of a speech recognition system that reflect the language which the system models, including for example, acoustic speech models, pronunciation entries, grammar models. In a preferred implementation of the present invention, speech models stored in a central repository, or database, are updated and then redistributed to participating client sites where speech recognition is carried out. In another embodiment, the repository may be co-located with a speech server, which in turn recognizes speech sent to it, as either processed (i.e. derived speech feature vectors), or unprocessed speech.
Speech recognition systems convert the utterance of a user into digital form and then process the digitized speech in accordance with known algorithms to recognize the words and/or phrases spoken by the user. For example, speech recognition systems have been disclosed wherein the digitized speech is processed to extract sequences of feature sets that describe corresponding speech passages. The speech passage is then recognized by matching the corresponding feature set sequence with the optimal model sequence.
The process of selecting the models to compare to the features derived from the utterance are constrained by prior rules which limit the sets of models and their corresponding configurations (patterns) that are used. It is these selected models and patterns that are used to recognize the user's speech. These rules include grammar rules and lexica by which the statistically most probable speech model is selected to identify the words spoken by the user. A conventional modeling approach that has been used successfully is known as the hidden Markov model (HMM). While HMM modeling has proven successful, it is difficult to employ a universal set of speech models successfully. Speech characteristics vary between speaker groups, within speaker groups and even for single individuals (due to health, stress, time, etc.), to the extent that a preset speech model may need adaptation or retraining to best adapt itself to the utterances of a particular user.
Such correction and retraining is relatively simple when adapting speech models of a user whose speech matches the data used to train the speech recognition system, because speech from the user and the training group have certain common characteristics. Hence, relatively small modifications to a preset speech model to adapt from those common characteristics are readily achievable, but large deviations are not. Various accents, inflections, pathological speech or other speech features contained in the utterances of such an individual are sufficiently different from the preset speech models as to inhibit successful adaptation retraining of those models. For example, the acoustic subwords pronounced by users whose primary language is not the system target language are quite different from the target language acoustic subwords to which the speech models of typical speech recognition systems are trained. In general, subword pronounced by “non-native” speakers typically exhibit a transitional hybrid between the primary language of those users and target language subwords. In another example, brain injury, or injury or malformation of the physical speech production mechanism, can significantly impair a speaker's ability to pronounce certain acoustic subwords in conformance with the speaking population at large. A significant subgroup of this speech-impaired population would require the speech models such a system would create.
Ideally, speech recognition systems should be trained with data that closely models the speech of the targeted speaker group, rather that adapted from a more general speech model. This is because it is a simpler, more efficient task, having a higher probability of success, to train uniquely designed speech models to recognize the utterances of such users, rather than correct and retrain preset system target-language models. However, the creation of uniquely designed speech models is time-consuming in and of itself and requires an large library of speech data and subsequently models that are particularly representative of, and adapted to several different speaker classes. Such data-dependency poses a problem, because for HMM's to be speaker independent, each HMM must be trained with a broad representation of users from that class. But, if an HMM is trained with overly broad data, as would be the case for adapting to speech the two groups of speakers exemplified above, the system will tend to misclassify the features derived from that speech.
This problem can be overcome by training HMM's less broadly, and then adapting those HMM's to the utterances of a specific speaker. While this approach would reduce the error rate (i.e. the rate of misclassification) for some speakers, it is of limited utility for certain classes of speakers, such as users whose language is not a good match with the system target language.
Another approach is to train many narrow versions of HMM's, each for a particular class of users. These versions may be delineated in accordance with various factors, such as the primary language of the user, a particular geographic region of the user's dialect, the gender of the user, his or her age, and so on. When combined with speaker adaptation, that is, the process of adapting an HMM to best match the utterances of a particular speaker, this approach has the potential to produce speech models of the highest accuracy. However, since there are so many actual and potential classes of users, a very large database of training data, (and subsequently speech models) would be needed. In addition, since spoken language itself is a dynamic phenomenon, the system target language speech models (lexicon, acoustic models, grammar rules, etc.) and sound system change over time to reflect a dynamic speaking population. Consequently, the library of narrow HMM versions would have to be corrected and retrained continually in order to reflect those dynamic aspects of speaking population at large. This would suggest a repository to serve as a centralized point for the collection of speech data, training of new models and the redistribution of these improved models to participating users, i.e. a kind of “language mirror”.