Speech recognition refers to various techniques for converting spoken words to text. Such techniques generally employ three models: an acoustic model, a lexicon model, and a language model. Acoustic models operate at the level of individual sounds made by speakers, and attempt to model speech based on these individual sounds (known as phones). Lexicon models operate at the level of words, and generally employ a pronunciation lexicon (i.e., dictionary) to model how phones are combined together to form words. Language models operate at a grammar level and are based on the relationships between words. These models of speech recognition function more or less adequately in situations where the speaker's accent is similar to that used to train the model.
However, traditional speech recognition models generally fail when attempting to recognize speech from an accented speaker because inconsistent pronunciations with respect to the native pronunciation of a word are especially frequent in such situations. In particular, the models described above tend to produce high error rates when employed by an accented speaker. Acoustic models fail because the speaker produces one or more phones incorrectly, lexicon models fail because the speaker learned one or more words incorrectly, and language models fail because the speaker's grammar is incorrect. Past efforts to correct these problems have attempted to use pronunciation modeling to adapt the pronunciation dictionary. Generally, these efforts involve the collection of large amounts of data from an accented speaker, labeling this data with the particular accent of the speaker, and changing the dictionary pronunciation according to the data collected. This process is referred to as pronunciation rewriting. Unfortunately, pronunciation rewriting relies on large amounts of input data to create a viable solution. Further, a significant shortcoming of traditional lexicon adaptations is their limited modeling context. Processes of re-writing rules are often limited to operating on a phone level, and fail to take advantage of contextual information at higher levels (e.g., at the level of modeling units larger than a phone).