1. Field of the Invention
Embodiments of the present invention generally relate to the field of voice processing. In particular, embodiments of the invention are related to methods, systems, and articles of manufacture used to improve the accuracy of speech recognition software.
2. Description of the Related Art
Voice processing systems are used to translate dictated speech into text. Typically, voice processing systems monitor the sound patterns of a user's speech and match them with words using a predefined dictionary of sound patterns. The result is a prediction of the most probable word (or phrase) that was dictated. For example, voice processing software may receive input from a microphone attached to a computer. While the user speaks into the microphone, the voice processing system translates the user's voice patterns into text displayed on a word processor. Another example includes a business call center where callers navigate through a menu hierarchy using verbal commands. As callers speak into the telephone receiver, an automated agent attempts to translate the caller's spoken commands, and initiate some action based thereon.
One goal for a voice processing system to operate at a rate comparable to the rate at which a user is speaking. Matching a given voice pattern to the words in a large predefined dictionary, however, can be time consuming and require substantial computational resources. Another goal of a voice processing system is to maximize accuracy, or conversely, to minimize errors in word translation. An error occurs when the voice processing system incorrectly translates a voice pattern into the wrong textual word (or phrase). Such an error must be manually corrected, forcing the user to interrupt a dictation session before continuing. Therefore, to maximize the usefulness of the voice processing system it is desirable to minimize such errors. These two goals conflict, however, as greater accuracy may cause the voice processing system to lag behind the rate at which a user dictates into the voice processing system.
As stated, speech recognition is a computationally intense task. For example, sound patterns may be measured using 10 to 26 dimensions or more, and then analyzed against the words in the dictionary. Thus, accuracy in speech recognition may be sacrificed for speed of execution. For example, many voice recognition systems use a tiered approach to word selection. A first tier, often referred to as a “fast match,” produces a very rough score used to select a set of candidate words (or phrases) corresponding to a given sound pattern. The voice recognition system then uses a language model to select the probability that a particular word (or phrase) was spoken. The speech recognition software reduces the set produced by the “fast match,” based on what the language model determines is likely to have been spoken. This reduced set is passed to a much slower “detailed match” algorithm, which selects the best word (or phrase) from the reduced set, based on the characteristics of the voice pattern.
Additionally, many fields (e.g., the legal and medical professions), have their own distinct vocabulary. Accordingly, one approach to defining a language model has been to provide a special dictionary that contains some additional industry terms. Further, the probability of particular words being spoken may be adjusted within the language model for a group of professionals in a given field. Thus, these approaches improve the accuracy of a voice recognition system, not by providing a better understanding of a speaker's voice patterns, but by doing a better job of understanding what the speaker is likely to say (relative to what has already been said). Similarly, many voice processing systems also provide a feature configured to scan documents authored by a given user, and adjust the language model to more accurately calculate how often a word is likely to be spoken by that given user.
Currently, the most common language models are n-gram models, which assume that the probability of a word sequence can be decomposed into conditional probabilities for a given word, based on the words that preceded it. In the context of an n-gram language model, a trigram is a string of three consecutive words. Similarly, a bigram is a string of two consecutive words, and a unigram is a single word. The conditional probability of a trigram may be expressed using the following notation: Prob (w1, w2, w3), which may be interpreted as “the probability that the word w3 will follow the words w1 and w2, in order.”
Thus, in some cases even if a given voice pattern may provide an excellent match for an uncommonly used word (or phrase), the system may select an inferior match with a higher probability in the n-gram language model. For example, even though a voice pattern may match a word like “Muskie” better than it matches “must be”, the probability that “must be” will be dictated is so much higher, that a language model whose probabilities are not correctly adjusted for the current real life situation may actually discard (or not select) “Muskie” when the voice system reduces the set produced by the fast match. Another simple example includes a user dictating the word “Ivan;” the voice processing system may incorrectly translate the voice pattern using another, more probable, word in the dictionary such as “I've been”.
Accordingly, even using the n-gram language model with the adjustments described above, voice recognition systems still produce a substantial number of mismatches between voice patterns and the resulting translated text. Therefore, there remains a need for methods that will improve the accuracy of a voice recognition system.