1. Field of the Invention
Embodiments of the present invention generally relate to the field of voice recognition software. In particular, embodiments of the invention are related to techniques for improving the accuracy of speech recognition software.
2. Description of the Related Art
Voice recognition systems are used to translate dictated speech into text. Typically, voice recognition systems monitor the sound patterns of a user's speech and match them with words using a predefined dictionary of sound patterns. The result is a prediction of the most probable word (or phrase) that was dictated. For example, voice recognition software may receive input from a microphone attached to a computer. While the user speaks into the microphone, the voice recognition system translates the user's voice patterns into text displayed on a word processor. Another example includes a business call center where callers navigate through a menu hierarchy using verbal commands. As callers speak into the telephone receiver, an automated agent attempts to translate the caller's spoken commands, and initiate some action based thereon.
One goal for a voice recognition system is to operate at a rate comparable to the rate at which a user is speaking. Matching a given voice pattern to the words in a large predefined dictionary, however, can be time consuming and require substantial computer resources. Another goal of a voice recognition system is to maximize accuracy, or conversely, to minimize errors in word translation. An error occurs when the voice recognition system incorrectly translates a voice pattern into the wrong textual word (or phrase). Such an error must be manually corrected, forcing the user to interrupt a dictation session before continuing. Therefore, to maximize the usefulness of the voice recognition system, it is desirable to minimize such errors. These two goals conflict, however, as greater accuracy may cause the voice recognition system to lag behind the rate at which a user dictates into the voice recognition system. If a voice recognition system operates too slowly, users may lack the patience to use the system.
As stated, voice recognition is a computationally intense task. For example, sound patterns may be measured using 10 to 26 dimensions or more, and then analyzed against the words in the dictionary. The more time spent analyzing sound patterns, the more accurate the results may become. Thus, accuracy in speech recognition may be sacrificed for speed of execution. To help compensate for this, many voice recognition systems use a tiered approach to word selection. A first tier often referred to as a “fast match,” produces a very rough score used to select a set of candidate words (or phrases) that may match a given sound pattern. The voice recognition system then uses a language model to select the probability that a particular word (or phrase) was spoken. The voice recognition software reduces the set produced by the “fast match,” based on what the language model determines is likely to have been spoken. This reduced set is passed to a much slower “detailed match” algorithm, which selects the best word (or phrase) from the reduced set, based on the characteristics of the voice pattern.
Additionally, many fields (e.g., the legal and medical professions), have their own distinct vocabulary. Accordingly, one approach to improving the accuracy of a language model has been to provide a special dictionary containing a selection of industry terms. Further, the probability of particular words being spoken may be adjusted within the language model for a group of professionals in a given field. Thus, these approaches improve the accuracy of a voice recognition system, not by providing a better understanding of a speaker's voice patterns, but by doing a better job of understanding what the speaker is likely to say (relative to what has already been said). Similarly, many voice recognition systems are configured to scan documents authored by a given user. Such voice recognition systems adjust the language model to more accurately calculate how often a word is likely to be spoken by that given user.
Currently, the most common language models are n-gram models, which assume that the probability of a word sequence can be decomposed into conditional probabilities for a given word, based on the words that preceded it. In the context of an n-gram language model, a trigram is a string of three consecutive words. Similarly, a bigram is a string of two consecutive words, and a unigram is a single word. The conditional probability of a trigram may be expressed using the following notation: Prob (w1 I w2, w3), which may be interpreted as “the probability that the word w1 will follow the words w2 and w3, in order.”
Additonally, current voice recognition systems rely on a “dictation-based” language model. That is, the voice recognition systems often rely only on the person dictating words for translation. At the same time, as processing power improves, voice recognition systems are finding broader applications. For example, many computer users rely on “instant messaging” (IM) applications for exchanging short messages of text with other users. A voice recognition system may be used in conjunction with an IM application. That is a voice recognition system can be used to translate the spoken word into text which is thereafter inputted into the IM application. In an IM application session, a “conversation” may take place between two or more people entirely in a text based form. IM applications are available for virtually any computer system available today, and are also available on many other devices such as PDAs and mobile phones. When a user engages in an IM text based “conversation” with another user, using a voice recognition system to translate spoken word into text, a voice recognition system may still rely on the single user dictation based language model, despite the reality that a user may be engaging in a conversation that includes other participants. Because word usage probabilities may be dramatically different depending on the context of a “dictation” session and a “conversation” session, an n-gram language model may produce an unacceptable percentage of mistakes in translating one-half of a text based conversation. (e.g., an IM session between two conversation participants).
Accordingly, even using the n-gram language model with the adjustments described above, voice recognition systems still produce a substantial number of mismatches between voice patterns and the resulting translated text. Therefore, there remains a need for methods that will improve the accuracy of a voice recognition system.