1. Field
The technology of the present application relates generally to speech recognition systems, and more particularly, to apparatuses and methods to update a language model associated with speech recognition systems for a number of similarly situated users dynamically rather than statically.
2. Background
The primary means for communication between people is speech. Since the early 1980s, significant progress has been made to allow people to interface with machines using speech through interfaces such as speech to text engines and text to speech engines. The former converts speech to a machine (and user) readable format; the later converts machine readable code to audio signals for people to hear.
Early speech to text engines operated on a theory of pattern matching. Generally, these machines would record utterances spoken by a person, convert them into phoneme sequences and match these sequences to known words or phrases. For example, the audio of “cat” might produce the phoneme sequence “k ae t”, which matches the standard pronunciation of the word “cat”. Thus, the pattern matching speech recognition machine converts the audio to a machine readable version “cat.” Similarly, a text to speech engine would read the word “cat”, convert it into a sequence of phonemes, each of which have a known audio signal, and, when concatenated (and appropriately shaped) produce the sound of “cat” (phonetically: “k ae t”). Pattern matching machines, however, are not significantly robust. Generally, pattern matching machines either operate with a high number of recognizable utterances for a limited number of users or operate with a higher number of users but a more limited number of recognizable utterances.
More recently, speech recognition engines have moved to a continuous or natural language speech recognition. The focus of natural language systems is to match the utterance to a likely vocabulary and phraseology and determine how likely the sequence of language symbols would appear in speech. Determining the likelihood of a particular sequence of language symbols is generally called a language model. The language model provides a powerful statistical model to direct a word search based on predecessor words for a span of n words. Thus, the language model will use probability and statistically more likely words for similar utterances. For example, the words “see” and “sea” are pronounced substantially the same in the United States of America. Using a language model, the speech recognition engine would populate the phrase: “Ships sail on the sea” correctly because the probability indicates the word “sea” is more likely to follow the earlier words “ship” and “sail” in the sentence. The mathematics behind the natural language speech recognition system are conventionally known as a hidden Markov model. The hidden Markov model is a system that predicts the next state based on the previous states in the system and the limited number of choices available. The details of the hidden Markov model are reasonably well known in the industry of speech recognition and will not be further described herein.
Generally speaking, speech recognition engines using natural language have users register with an account. More often than not, the speech recognition downloads the recognition application and database to the local device making it a fat or thick client. In some instances, the user has a thin client where the audio is routed to a server that has the application and database that allows speech recognition to occur. The client account provides an audio profile and language model that is tuned to a particular user's voice and speech. The initial training of a natural language speech recognition engine generally uses a number of “known” words and phrases that the user dictates. The statistical algorithms which map audio signals to phonemes are modified to match the user's voice. Subsequent training of the speech recognition engine may be individualized by corrections entered by a user to transcripts when the transcribed speech is incorrect. While any individual user's speech recognition engine is effectively trained to the individual, the training of the language model is potentially inefficient in that common phrases and the like for similarly situated users must be input individually for each installed engine and/or each user. Moreover, changes in language modeling that a single user identifies that would be useful for multiple similarly situated users cannot be propagated through the speech recognition system without a new release of the application and database.
Thus, against this background, it is desirable to develop improved apparatuses and methods to update a language model in a speech recognition system.