The present disclosure relates to customizing language models for speech-to-text recognition.
Language models are used to model a probability that a string of words in a given vocabulary will occur. Language models are used in many natural language processing applications, including automatic speech recognition, machine translation, and information retrieval.
A speech-to-text recognition system typically digitizes an audio signal into discreet samples. Those discreet samples are generally processed to provide a frequency domain analysis representation of the original input audio signal. With the frequency domain analysis of the signal, a recognition system maps the frequency domain information into phonemes. Phonemes are the phonetic sounds that are the basic blocks used to create words in every spoken language. For example, the English written language has an alphabet of 26 letters. However, the vocabulary of English phonemes is typically a different size.
The mapping provides a string of phonemes mapped to the frequency domain analysis representation of the original input signal. Speech detection processing resolves the phonemes using a concordance or a dictionary. In the case of homonyms, for instance, the word “whale,” the listener may not know whether the intended word is an ocean mammal “whale” or a cry “wail”.
Speech detection systems may have concordances of multi-token strings that are called “n-grams.” In an n-gram model, the probability of a last word in a string of n words is a conditional probability based on the preceding n−1 words in the string. For the homonym example above, a speech detection system can be used to determine whether it is more likely that a string of phonemes corresponds to “I saw a blue whale” or “I saw a blue wail” based on a calculated conditional probability that either “whale” or “wail” occurs with the context of “I saw a blue”.
The conditional probabilities for a given string are generally determined empirically, according to relative frequencies in a collection of text. Conditional probabilities for strings within the maximum n-gram order in the n-gram language model correspond to the probability stored in the language model for the n-gram, e.g., p(whale|I saw a blue) is the conditional probability stored in the language model for the 5-gram entry “I saw a blue whale”. Thus, a language model in a speech-to-text system provides a statistical likelihood that a sequence of phonemes corresponds to a given word using the context of the preceding tokens. The output from this process is a textual transcription of the original input signal.