The invention relates generally to computer-based or software-based speech recognition systems, and relates more specifically to an approach for improving accuracy of a computer-based speech recognizer by modifying its pronunciation dictionary based on pattern definitions of alternate word pronunciations.
Many computer-based or software-based speech recognition systems use a pronunciation dictionary to identify particular words contained in received utterances. The term xe2x80x9cutterancexe2x80x9d is used herein to refer to one or more sounds generated either by humans or by machines. Examples of an utterance include, but are not limited to, a single sound, any two or more sounds, a single word or two or more words.
The pronunciation dictionary in a phoneme-based recognizer defines words in terms of sets of phonemes, in which each phoneme is one of a small set of speech sounds that are distinguished by the speakers of a particular language. For example, the word xe2x80x9cpanxe2x80x9d may be characterized by phonemes xe2x80x9cpxe2x80x9d (the hard xe2x80x9cpxe2x80x9d sound), xe2x80x9cahxe2x80x9d (the short xe2x80x9caxe2x80x9d sound), xe2x80x9cnxe2x80x9d. Phonemes are roughly equivalent to the pronunciation symbols that are used in textual dictionaries to aid the reader in determining how to pronounce a word.
The speech recognition system has a set of numeric data (xe2x80x9cmodelxe2x80x9d) for each of the phonemes. A word is modeled by assembling the models for each phoneme that makes up the word.
In general, a pronunciation dictionary contains data that defines expected pronunciations of utterances. When an utterance is received, the received utterance, or at least a portion of the received utterance, is compared to the expected pronunciations contained in the pronunciation dictionary. An utterance is recognized when the received utterance, or portion thereof, matches the expected pronunciation contained in the pronunciation dictionary.
One of the most important concerns with pronunciation dictionaries is to ensure that expected pronunciations of utterances defined by the pronunciation dictionary accurately reflect actual pronunciations of the utterances. If an actual pronunciation of a particular utterance does not match the expected pronunciation, the speech recognition system may not be successful in recognizing words and may have a flawed pronunciation dictionary.
Actual pronunciations of utterances can be misrepresented for a variety of reasons. For example, in fluent speech, some sounds may be systematically deleted or adjusted. An application program (xe2x80x9capplicationxe2x80x9d) that uses the speech recognition system may be installed across diverse geographic areas where users have different regional accents. The nature of the application may inherently cause repeated errors in recognition because words are used with unexpected pronunciations. Further, expected pronunciations tend to be somewhat user-dependent. Consequently, a change in the users of a particular application can adversely affect the accuracy of a speech recognition system. This is attributable to different speech characteristics of users, such as different intonations and stresses in pronunciation.
Conventionally, pronunciation dictionaries are updated manually to reflect changes in actual pronunciations of utterances in response to reported problems. When a change in an application or user prevents a speech recognition system from recognizing utterances, the problem is reported to the administrator of the speech recognition system. The administrator then identifies the problem utterances and manually updates the pronunciation dictionary to reflect the changes to the application or users.
Manually updating a pronunciation dictionary to reflect changes to an application or users has several significant drawbacks. These problems, and an approach that addresses them using automatic dictionary updating, are described in detail in co-pending application Ser. No. 09/344,164, filed on Jun. 24, 1999, entitled AUTOMATICALLY DETERMINING THE ACCURACY OF A PRONUNCIATION DICTIONARY IN A SPEECH RECOGNITION SYSTEM, in the name of inventor Etienne Barnard.
Although manual dictionary updating and automatic dictionary updating are useful, these approaches still have drawbacks that are susceptible to improvements. For example, these approaches do not include a mechanism whereby modifications to the dictionary can be generalized or characterized in terms of sound or word patterns. To that extent, they represent responses to the problem that do not recognize the root causes of recognition errors, namely that alternate pronunciations are being used.
In addition, the prior approaches do not effectively improve recognition in the context of a particular or specific application that uses the speech recognition system. The context or vocabulary of a particular application may require speakers to use words that are not adequately or specifically corrected in the prior approaches.
Based on the foregoing, there is a clear need in this field for an improved speech recognition system that can adjust a pronunciation dictionary to account for recognition errors that occur in a particular application.
There is also a need for an improved speech recognition system in which alternate pronunciations are addressed based on generalized sound patterns rather than specific sound differences.
Other needs will become apparent from the following description.
The foregoing needs, and other needs and objects that will become apparent from the following description, are achieved by the present invention, which comprises, in one aspect, an approach for automatically modifying a pronunciation dictionary in a speech recognition system based on patterns of alternate pronunciations is described. A representation of the pronunciation dictionary, such as a plurality of dynamically linked phoneme values, is obtained. One or more pattern definitions are obtained. The pattern definitions specify zero or more phonemes to be substituted for zero or more phonemes of all words in the pronunciation dictionary. The linked phoneme values are modified by adding, for each path of each word, alternate paths that use each of the substitute phoneme strings according to the pattern definitions, thereby creating an expanded set of dynamically linked phoneme values.
One or more example pronunciations of a particular word are then obtained. One or more best paths through the expanded set of phoneme values are determined for each of the example pronunciations. For each of the best paths, an alternate word pronunciation is constructed by converting each path into a pronunciation using the format of the pronunciation dictionary. The pronunciation dictionary is modified by adding each alternate word pronunciation. As a result, a modified pronunciation dictionary is created that accounts for alternate pronunciations as actually spoken by users of a particular speech recognition application.