1. Technical Field
The present invention relates to voice recognition and in particular to speaker independent, continuous speech, word spotting voice recognition.
2. Background Art
Voice recognition engines analyze digitized speech files and return to their calling programs the resulting word or phrase. A digitized speech file contains a digital representation of an utterance. An utterance is a single body of speech spoken by a person. Such an utterance can be a word, a sentence, a paragraph, or an even larger body of speech.
Voice recognition engines which translate utterances can be classified by two characteristics: the plurality of speakers who can use the system and the type of speech that the system can accept. Voice recognition engines can be either speaker dependent or speaker independent. Speaker dependent engines require a voice recognition engine to be "trained" on a speaker's voice before the engines can recognize the speaker's speech. Speaker independent engines can recognize speech from any speaker without needing to be "trained" on the voices.
Voice recognition engines can also accept either continuous or discrete speech. Engines which use discrete speech require that the speaker pause between every word for at least 1/10th of a second. Continuous speech engines allow the speaker to talk at a normal rate of up to 200 words per minute. There are two ways for handling continuous speech, in the first an utterance is compared against a library of phrases and the phrase that is the closest to the utterance is returned. In the second, word spotting is used in which the speech recognition engine examines whole speech segments to identify any occurrences of the words the engine has been instructed to look for. The set of words the engine is looking for is known as the "Active Vocabulary." Word spotting is drastically harder to perform compared to the first method because the word or words to be identified have to be extracted from a messy seamless stream of phonemes regardless of: the placement of the words in the utterance, the order of the words, or the general shape or quality of the phonemes in the digitized speech file (such as due to the slurred or accented voice of the speaker).
Currently, voice recognition engines that perform speaker independent, continuous word spotting are limited to active vocabularies of 50 words or less. This means that these types of engines can look for, i.e. "spot", a maximum of 50 distinct words at the same time when analyzing an utterance. This limit of 50 words relegates continuous speech word spotting engines to carefully controlled, almost contrived, applications, such as spotting the digits 0 through 9 or word spotting an item from a small list of choices.
At this point, it is important to understand the difference between a resident vocabulary and the size of a word spotting voice recognition engine's active vocabulary. The resident vocabulary is the number of words that the word spotting voice recognition engine can have stored in memory and available for use when they are needed. However, the word spotting voice recognition engine is unable to look for all of those words at the same time. Therefore, only some of the words in the resident vocabulary are activated. The remaining vocabulary words remain inactive. The number of activated words is the number of words that a word spotting voice recognition engine can simultaneously look for in an utterance. If the speaker uses any words that have not been activated by the word spotting voice recognition engine, these words will not be recognized.
Speaker dependent, discrete speech voice recognition engines currently can have large active vocabularies that may contain many thousands of words, as opposed to continuous speech, word spotting systems. The primary disadvantage to such discrete speech systems is that these speech systems force the speaker to speak in a very unnatural, tiring manner. Except for a few limited cases of dictation, this makes them unusable for most commercial applications. A second disadvantage with such systems force the user to train the voice recognition engine on their voice for several hours and then require the user to pause between each word while speaking slowly. These systems can only accept such broken speech at a maximum rate of 100 words per minute. In a normal speaking situation, nobody pauses between every word and most people speak at rates between 120 and 200 words a minute.
Another disadvantage of such speaker dependent systems is that there are very few commercial applications that can afford the time it takes to train the engine on a speaker's voice. For example, asking the user of an automated teller machine to undergo a training session for an hour or two before using the system is completely unfeasible. In fact, anything short of no training time required is commercially useless. This is why the vast majority of possible applications of speech recognition have not been realized yet.
The prior art has failed to provide an advanced voice recognition system which accepts speaker independent, continuous speech utterances and can accurately perform word spotting on the utterances based upon an active vocabulary of thousands of words. Such a system must be able to respond to a user in real-time.