The wide-spread advancements in technology, generally, are driving more significant improvements to systems. Processing speed and increased memory capabilities facilitate more robust software implementations. For example, speech recognition for client-based programs, operating systems, and other related products strive to adopt a more user-friendly interface such that a user (e.g., for customized interaction) can correct the recognition results or add new words/phrases. However, technology still lags in taking advantage of user feedback for making the system more adaptive and for providing a better fit to user input.
In automatic speech recognition, and more specifically, large vocabulary speech recognition, language models play a very important role in predicting or verifying the words voiced by the user. Conventionally, a widely adopted model for speech recognition is a statistical language model (SLM), which is a maximum likelihood (ML) estimation of conditional probability over a sufficiently large corpus, given context history. However, such a model has several limitations. Firstly, the model is trained separately as a single component of the SR system and does not use any information or feedback from acoustical models, lexicon, and a recognizer (speech recognition engine). In other words, the potential to minimize the recognition errors might not be fully explored when using this type of model.
Secondly, the model is not easy to adapt to the new context because the model needs enough data to support the parameter estimations. Several revised versions of LM adaptation algorithms based on ML estimation have been proposed to address these shortcomings, but do not use criterion which directly links to the recognition accuracy. Thus, again, the potential for the LM algorithm learning from target scenario is not fully realized.
Context-free grammar (CFG) is also widely used in dialog systems based on speech recognition. CFG allows several phrases (comprised of words or terms) to be active at the beginning of the dialog. The recognizer then provides the best candidate(s) after aligning the user's speech against those phrases in the CFG. Typically, there can be weights assigned to each phrase (or term) within the CFG, and in practice, the weights are assigned equally or arbitrarily no matter what the term and how the similar the term sounds to another term. Accordingly, inappropriate values for the weights can limit the ability of speech recognizer to provide reasonable results. Additionally, there is no technique that adjusts the weighting values to better fit one target scenario or target speaker even if speaker-adapted acoustic models could be provided.