Automatic speech recognition (ASR), as discussed herein, is the transcription, by machine, of audio speech into text. Among the various approaches to automatic speech recognition are statistically-based speech recognition techniques, often including acoustic modeling and language modeling.
An acoustic model generally is trained to analyze acoustic features of an input speech signal and generate one or more hypotheses as to the sequence of sound units that the signal contains. Depending on the acoustic model being used, the sound units may be of different lengths or levels in the hierarchy of sound sequences that make up a language. For example, some acoustic models may model words as units, and may generate one or more hypotheses of sequences of words that could match the acoustics of the speech signal. Other acoustic models may model sub-word units such as phonemes, diphones, triphones, or syllables, and may generate one or more hypotheses of sequences of these sub-word units that could match the acoustics of the speech signal. Popular types of acoustic models today include hidden Markov models (HMMs) and neural networks.
A language model generally is trained to work with an acoustic model to determine which candidate word sequences that could match the acoustics of the speech signal are most likely to be what the speaker actually said. For example, “Hello, how are you?” and “Hell low ha why uh!” might both match the acoustics of a particular speech signal in the English language, but it is much more likely that the speaker said the former sequence of words than the latter. Statistical language models are generally trained by being exposed to large corpora of text and observing the occurrence frequencies of various possible sequences of words in those training corpora. The probabilities of different word sequences learned from the training data are then applied to score the likelihood of different candidate word sequences hypothesized for an input speech signal. In this sense, statistical language models are different from fixed grammars, which are typically made up of hard-coded rules regarding which word sequences are allowable for speech recognition in a particular application. Since a statistical language model (SLM) generally assigns a likelihood or probability to a candidate word sequence based on known word sequences that have been encountered before, SLMs are typically more useful than fixed grammars for recognition of free-speech inputs, in applications where there are few, if any, restrictions on what the speaker might say. A popular form of SLM today is the N-gram language model, which approximates the probability of a longer word sequence as a combination of the probabilities of each word in the sequence in the context of the preceding N−1 words. For example, a trigram SLM might approximate the probability of “Hello, how are you?” as P(Hello|<s>,<s>)P(how|<s>,Hello)P(are|Hello,how)P(you|how,are)P(</s>|are,you), where <s> and </s> refer to sentence beginning and sentence end, respectively, and P(w3|w1,w2) denotes the probability of encountering word w3 next after encountering word w1 followed by word w2.