Automated speech recognition (ASR) systems are used throughout many industries to convert spoken words into text. For example, ASR systems are used to automatically translate speech between languages, control computers, and convert voicemails into text. An ASR system receives as an input an analog audio signal, a digital audio signal, or other data representing audio, and produces a text string representing spoken words that it recognizes in the input. ASR systems generally perform this conversion by splitting an audio signal into small segments and matching the small segments to phonemes in the language to which spoken words from the audio are to be converted. The resulting phonemes are combined to identify words.
From an ideal audio source, some ASR systems can recognize spoken words with an accuracy approaching 99%. However, various conditions or characteristics associated with an audio source can reduce an ASR system's recognition rate. For example, an ASR system may not recognize a word because of background noise or because of a speaker's thick accent. While some applications can tolerate an ASR system that misrecognizes an occasional word, in other applications, such as in military, health care, or communications applications, a misrecognized word can cause serious harm.
Existing ASR systems employ different algorithms to compensate for suboptimal audio sources or input. As a result, some ASR systems perform better under certain conditions than others. For example, a first ASR system might recognize words with a high accuracy from an audio sample in which a person speaks fast while a second ASR system recognizes few words from the same audio sample. However, the second ASR system might recognize more words than the first ASR system when the person talks with an accent. Unfortunately, despite years of research and development, no single ASR system is dynamic and sophisticated enough to realize near-perfect recognition rates under suboptimal conditions.