Field
The technology of the present application relates generally to speech recognition systems, and more particular, to apparatuses and methods to allow for deployment of a speech recognition engine initially using a pattern matching recognition engine that allows for training of and eventual conversion to a speech recognition engine that uses natural language.
Background
Early speech to text engines operated on a theory of pattern matching. Generally, these machines would record utterances spoken by a person, convert the audio into a sequence of possible phonemes, and then find a sequence of words that is allowed by the pattern and which is the closest, or most likely, match to the sequence of possible phonemes. For example, a person's utterance of “cat” provides a sequence of phonemes. These phonemes can be matched to reference phonetic pronunciation of the word “cat”. If the match is exact or close (according to some algorithm), the utterance is deemed to match “cat”; otherwise, it is a so-called “no-match”. Thus, the pattern matching speech recognition machine converts the audio file to a machine readable version “cat.” Similarly, a text to speech engine would read the data “cat”, convert “cat” into its phonetic pronunciation and then generate the appropriate audio for each phoneme and make appropriate adjustments to the “tone of voice” of the rendered speech. Pattern matching machines, however, have limitations. Generally, pattern matching machines are used in a speaker independent manner, which means they must accommodate a wide range of voices, which limits the richness of patterns that will provide good matches across a large and diverse population of users.
Pattern matching speech recognition engines are of value because they are deployable and usable relatively rapidly compared to natural language speech recognition. However, as they are not overly robust, pattern matching speech recognition is currently of limited value because it cannot handle free form speech, which is akin to pattern matching with an extremely large and complex pattern.
In view of these limitations, speech recognition engines have moved to a continuous or natural language speech recognition system. The focus of natural language systems is to match the utterance to a likely vocabulary and phraseology, and determine how likely the sequence of language symbols would appear in speech. Determining the likelihood of a particular sequence of language symbols is generally called a language model. The language model provides a powerful statistical model to direct a word search based on predecessor words for a span of n words. Thus, the language model will use probability and statistically more likely words for similar utterances. For example, the words “see” and “sea” are pronounced substantially the same in the United States of America. Using a language model, the speech recognition engine would populate the phrase: “Ships sail on the sea” correctly because the probability indicates the word “sea” is more likely to follow the earlier words in the sentence. The mathematics behind the natural language speech recognition system are conventionally known as the hidden Markov model. The hidden Markov model is a system that predicts the value of the next state based on the previous states in the system and the limited number of choices available. The details of the hidden Markov model are reasonably well known in the industry of speech recognition and will not be further described herein.
Generally speaking, speech recognition engines using natural language have users register with an account. More often than not, the speech recognition engine downloads the application and database to the local device making it a fat or thick client. In some instances, the user has a thin client where the audio is routed to a server that has the application and database that allows speech recognition to occur. The client account provides a generic language model that is tuned to a particular user's dialect and speech. The initial training of a natural language speech recognition engine generally uses a number of “known” words and phrases that the user dictates. The statistical algorithms are modified to match the user's speech patterns. Subsequent modifications of the speech recognition engine may be individualized by corrections entered by a user to transcripts when the transcribed speech is returned incorrect. While any individual user's speech recognition engine is effectively trained to the individual, the training of the language model is inefficient in that common phrases and the like for similarly situated users must be input individually for each installed engine. Moreover, changes that a single user identifies that would be useful for multiple similarly situated users cannot be propagated through the speech recognition system without a new release of the application and database.
While significantly more robust, natural language speech recognition engines generally require training to a particular user's speech patterns, dialect, etc., to function properly, the training is often time consuming and tedious. Moreover, natural language speech recognition engines that are not properly trained frequently operate with mistakes causing frustration and inefficiency for the users. In some cases, this may lead to the user discontinuing the implementation of the natural language speech recognition engine.
Thus, against this background, it is desirable to develop improved apparatuses and methods for deployment and training of natural language speech recognition engines.