Automatic Speech Recognition (“ASR”) systems are designed to convert an audio message containing speech into text. Recognition accuracy for a particular utterance can vary based on many factors including the audio fidelity of the recorded speech, correctness of the speaker's pronunciation, and the like. These factors contribute to continuously varying levels of recognition accuracy which can result in several possible transcriptions for a particular utterance.
Language models (“LMs”), which may include hierarchical language models (“HLMs”), statistical language models (“SLMs”), grammars, and the like, assign probabilities to a sequence of words by means of a probability distribution and try to capture the properties of a language so as to predict the next word in a speech sequence. They are used in conjunction with acoustic models (“AMs”) to convert dictated words to transcribed text. The current state of the art with regard to both creating and updating AMs and LMs requires speech scientists to manually process hundreds to thousands of hours of spoken phrases or words to build AM and LM databases containing phonemes, all of the possible words within a spoken language, and their statistical interrelationships. ASR engines then compare an audio fingerprint against the AMs and LMs with the goal of obtaining a statistically significant match of the spoken audio to its textual representation. There is great expense in this process since a great deal of engineering time is required to generate and update AMs and LMs as languages continue to evolve and new words are continually coined and used in common lexicon.
Thus, a need exists for an automated, less labor intensive approach for generating and updating LMs for use in ASR systems.