Speech recognition systems, or automatic speech recognizers, have become increasingly important as more and more computer-based devices use speech recognition to receive commands from a user in order to perform some action as well as to convert speech into text for dictation applications or even hold conversations with a user where information is exchanged in one or both directions. Such systems may be speaker-dependent, where the system is trained by having the user repeat words, or speaker-independent where anyone may provide immediately recognized words. Speech recognition is now considered a fundamental part of mobile computing devices. Some small vocabulary systems that use small language models (LMs) may be configured to understand a fixed set of single word commands or short phrases, such as for operating a mobile phone that understands the terms “call” or “answer”, or an exercise wrist-band that understands the word “start” to start a timer for example. These may be referred to as command and control (C&C) systems. Other systems may have very large vocabularies and use statistical language models (SLMs) such as for dictation or voice activated search engines found on smart phones.
The conventional automatic speech recognition (ASR) system receives audio data from an audio signal with human speech and then constructs phonemes from the extracted sounds from that signal. Words are then constructed with the phonemes, and then word sequences or transcriptions are built from the words until one or more output transcriptions are developed. Thereafter, a confidence score is generated for each transcription and is used to determine whether the output transcription is accurate and should be used or presented to a user, usually by comparing the confidence score to a threshold. The generated confidence scores, however, often can be uninformative especially with small vocabularies when there are very few alternative transcriptions being considered. This can result in errors where the system presents the wrong words to the user during dictation for example, or misunderstands or cannot understand the user's inquiry or command when the computer is supposed to perform an action in response to the user's language.