1. Field
The technology of the present application relates generally to speech recognition systems, and more particular, to apparatuses and methods to create and use a minimally complete user specific language model.
2. Background
Early speech to text engines operated on a theory of pattern matching. Generally, these machines would record utterances spoken by a person, convert the audio into a sequence of possible phonemes, and then find a sequence of words that is allowed by the pattern and which is the closest, or most likely, match to the sequence of possible phonemes. For example, a person's utterance of “cat” provides a sequence of phonemes. These phonemes can be matched to reference phonetic pronunciation of the word “cat”. If the match is exact or close (according to some algorithm), the utterance is deemed to match “cat”; otherwise, it is a so-called “no-match”. Thus, the pattern matching speech recognition machine converts the audio to a machine readable version “cat.” Similarly, a text to speech engine would read the data “cat”, convert “cat” into its phonetic pronunciation and then generate the appropriate audio for each phoneme and make appropriate adjustments to the “tone of voice” of the rendered speech.
Pattern matching machines, however, have limitations. Generally, pattern matching machines are used in a speaker independent manner, which means they must accommodate a wide range of voices and which limits the richness of patterns that will provide good matches across a large and diverse population of users.
Pattern matching speech recognition engines are of value because they are deployable and usable relatively rapidly compared to natural language or free form, continuous speech recognition engines. They can recognize simple formulaic responses with good accuracy. However, as they are not overly robust, pattern matching speech recognition is currently of limited value because it cannot handle free form speech, which is akin to pattern matching with an extremely large and complex pattern.
In view of these limitations, speech recognition engines have moved to a free, form continuous or natural language speech recognition system. The focus of natural language systems is to match the utterance to a likely vocabulary and phraseology, and determine how likely the sequence of language symbols would appear in speech. Continuous speech recognition engines return sequences of words which are the best fit for the audio. For a given sequence of words, the fit is a combination of two scores (or probabilities): one score indicates how well the phonemes for the words match the supplied audio; and the other is the likelihood of that sequence (of words) given the supplied language model (hereinafter “language model” or “LM”). Similar sounding sequences of words will have similar phonemic scores (how well their phonemes match the audio). However, the same similar sounding sequences may have quite different likelihoods when scored by the language model. The LM provides a powerful model to direct a word search based on predecessor words for a span of n words. In other words, a natural language or free form speech recognition engine uses an acoustic model to match phonemes and a language model to determine whether a particular word or set of words is more likely than another word or set of words.
The LM uses probability to select the more likely words for similar sounding utterances. For example, the words “see” and “sea” are pronounced substantially the same in the United States of America. Using the LM, the speech recognition engine would populate the phrase: “Ships sail on the sea” correctly because the probability indicates the word “sea” is more likely to follow the earlier words in the sentence.
The mathematical model which determines what phoneme sequence(s) are the best match to the supplied audio is called the Hidden Markov Model. The details of the hidden Markov model are well known in the industry of speech recognition and will not be further described herein.
Developing a LM, as mentioned above, is reasonably well known in the industry; the details of the development will, therefore, not be discussed herein in detail. However, by way of background, a general overview of the operation of a language model will be explained. Conventionally, the language model is a statistical model of word sequences, but for sake of generality, a model may contain other information that is used in the selection of word sequences, such as domain expertise about what word sequences are possible or not possible in that domain. Without loss of generality, we will use statistical language models as exemplars of language models in our discussion, since they are well known and understood by those familiar with speech recognition. A (statistical) language model is generally calculated from a corpus of words. The corpus, generally, is obtained from written text samples with the assumption that the written text is a good approximation of the spoken word. The corpus may include, among other things, text books, emails, transcriptions, notes, papers, presentations, or the like. Using the corpus, the language model provides summary statistics on the occurrence of unigrams, bigrams, and trigrams (i.e., n-grams up to some cutoff value, which is usually three or four). For example, how often a single word appears in the language (a “unigram”), such as, for example, the word “sea” as identified above, how often a combination of two words in a particular order appear in the language (a “bi-gram”), such as, for example, “the sea”, how often a combination of three words appear in a particular order (a “tri-gram”), such as, for example, “on the sea”, how often a combination of four words in a particular order appear in the language (a “quadra-gram”), such as, for example, “sail on the sea”, and so on. While the language model can extend to penta-grams, hexa-grams and so on, there is currently a practical limit on the processing power and memory requirements for the speech recognition engine. Also, for simplicity, the technology of the present application will generally be explained with respect to tri-gram engines; although, the technology explained herein is applicable to a language model using any length of word sequences.
In operation, a conventional speech recognition system or engine for continuous speech recognition uses a combination of hidden Markov models (HMMs) and LMs to convert an audio signal into a transcript. The HMMs support the conversion of an acoustic signal into phonemes, while the LMs support the conversion of phonemes into sequences of words. Conventionally, these scores are expressed as probabilities, but they don't have to be. What is important is that you can combine both scores into a single overall score, so that the word sequences with the highest overall scores are the best matches to what was said. While scoring a given sequence of words is straight forward, the task of rapidly generating plausible sequences while discarding implausible ones is the core of the computation inside a recognition engine. In principle, you can combine the audio and language model scores with other scoring systems: e.g., grammar or domain specific knowledge, as is generally known in the art, but not further discussed herein, as it is unnecessary for an understanding of the technology of the present application. As mentioned above, LMs contain the probabilities that a given word will be seen after a (specified) preceding word sequence, the length of the preceding sequence is two (for 3-gram models) or three (for 4-gram models). In general, the length of the preceding sequence is arbitrary. However, due to constraints in computer speed and memory, commonly available recognizers today are limited to either 3-gram (most common) or 4-gram (less common) LMs. The technology of the present application will be explained using 3-gram (trigram) LMs, with the understanding that the discussion readily extends to higher or even lower order LMs.
The LM is prepared from a collection of transcripts known as a corpus as explained above. An analyzer, in the form of a program operating on a processor, discovers all the individual words that occur in the corpus, plus all the bigrams (word pairs) and trigrams (word triplets), and it lists everything it finds in a table. Each entry has the probability of the n-gram plus a back-off weight if the entry is for a 1- or 2-gram, but not for a 3-gram (if the LM is order N, then there are back-off weights for the lower order n-grams where n<N). If you see an n-gram in the table, you know it appeared in the corpus. The purpose of the back-off weights is to give the LM flexibility to estimate the probabilities of bigrams and trigrams that did not occur in the corpus, but could reasonably be expected to occur in real-world use. In other words, there is an implicit assumption that the corpus is “incomplete” and the LM should accommodate this—but how? Let us start with the section of the LM which lists all the different words found in the corpus (i.e., the 1-grams). Pick three of those words at random: w1, w2, w3, and then ask the question “What is the probability predicted by the LM for w3 to occur after w1 and w2?” To find the answer, we look at all the entries in the 3-gram section of the LM to see if there is an entry for “w1 w2 w3”. If there is, use the probability that was recorded for this trigram. If we cannot find an entry for “w1 w2 w3”, we back-off from requiring a 3 word sequence and look for the two word sequence “w2 w3”. If we find an entry for “w2 w3”, we use its probability and multiply by the back-off weight for “w1 w2” (or 1.0 if we cannot find a back-off weight). If there is no entry for “w2 w3”, we back-off from using w2 and look up the probability of “w3” in isolation (which is recorded in the unigram section of the model) and multiply by the back-off weight for w2. Recapping: the LM lets us estimate the probability that any n-gram that can be made from the words in the LM. The fact that a probability can be estimated does not mean it is true in the real world, however, it is what the recognition engine will use for its best guess (or guesses if you are using “N-best”) as to what the speaker said.
While a powerful tool, the LM conventionally is generic across multiple similarly situated users and in thus it is at risk of being a “Jack of all trades, and master of none”. In particular, the corpus of material is often generated from a large volume of material, none of which may have been generated from any of the current users of the speech recognition system or engine using the LM. The LM may include n-grams that the actual speaker may, in fact, never use. Additionally, the LM may not include n-grams that the actual speaker does use. This results in errors and inefficiencies.
Thus, against this background, it is desirable to provide a user specific language model.