Automatic speech recognition (ASR), as discussed herein, is the transcription, by machine, of audio speech into text. Among the various approaches to automatic speech recognition are statistically-based speech recognition techniques, often including acoustic modeling and language modeling. An acoustic model generally is trained to analyze acoustic features of an input speech signal and generate one or more hypotheses as to the sequence of sound units that the signal contains. Popular types of acoustic models today include hidden Markov models (HMMs) and neural networks. A language model generally is trained to work with an acoustic model to determine which candidate word sequences that could match the acoustics of the speech signal are most likely to be what the speaker actually said. Statistical language models (SLMs) are generally trained by being exposed to large corpora of text and observing the occurrence frequencies of various possible sequences of words (and/or other suitably defined tokens) in those training corpora. The probabilities of different word sequences learned from the training data are then applied to score the likelihood of different candidate word sequences hypothesized for an input speech signal. A popular form of SLM today is the N-gram language model, which approximates the probability of a longer word sequence as a combination of the probabilities of each word in the sequence in the context of the preceding N−1 words.
ASR is useful in a variety of applications, including in dictation software which recognizes user speech and outputs the corresponding automatically transcribed text. A typical dictation application may output the transcribed text of the dictated speech to a visual display for the user's review, often in near real-time while the user is in the process of dictating a passage or document. For example, a user may dictate a portion of a passage, the dictation application may process the dictated speech by ASR and output the corresponding transcribed text, and the user may continue to dictate the next portion of the same passage, which may subsequently be processed, transcribed, and output. Alternatively or additionally, some dictation applications may output text transcriptions via one or more other media, such as printing on a physical substrate such as paper, transmitting the text transcription to a remote destination, non-visual text output such as Braille output, etc.