E. W. Brown, et al., observe that humans understand speech with ease and use speech to express complex ideas, information, and knowledge. See, IBM Systems Journal, Volume 40, Number 4, 2001, which is accessible at http://www.research.ibm.com/journal/sj/404/brown.html. Unfortunately, automatic speech recognition with computers is very hard, and extracting knowledge from speech is even more difficult. Nevertheless, the potential reward for solving this problem is worth the effort. Major speech recognition applications include dictation or document creation systems, navigation or transactional systems such as automated interactive voice response systems, and multimedia indexing systems.
In dictation systems, the words spoken by a user are transcribed verbatim into text to create documents, e.g., personal letters, business correspondence, etc. In navigation systems, the words spoken by a user may be used to follow links on the Web or to navigate around an application. In transactional systems, the words spoken by a user are used to conduct a transaction such as a stock purchase, banking transaction, etc. In multimedia indexing systems, speech is used to transcribe the audio into text, and subsequently, information retrieval techniques are applied to create an index with time offsets into the audio. Advances in technology are making significant progress toward the goal of allowing any individual to speak naturally to a computer on any topic and have the computer accurately understand what was said. However, we are not there yet. Even state-of-the-art continuous speech recognition systems require the user to speak clearly, enunciate each syllable properly, and have his or her thoughts in order before starting. Factors inhibiting the pervasive use of speech technology today include the lack of general-purpose, high-accuracy continuous-speech recognition, lack of systems that support the synergistic use of speech input with other forms of input, and challenges associated with designing speech user interfaces that can increase user productivity while being tolerant of speech recognition inaccuracies.
Speech recognition systems are typically based on hidden Markov models (HMM's), used to represent speech events, e.g., a word, statistically, and where model parameters are trained on a large speech database. Given a trained set of HMM's, there exists an efficient algorithm for finding the most likely word sequence when presented with unknown speech data. The recognition vocabulary and vocabulary size play a key role in determining the accuracy of a system. A vocabulary defines the set of words or phrases that can be recognized by a speech engine. A small vocabulary system may limit itself to a few hundred words, whereas a large vocabulary system may consist of tens of thousands of words. Dictation systems are typically large vocabulary applications, whereas navigation and transactional systems are typically small vocabulary applications. Multimedia indexing systems could be either large vocabulary or small vocabulary applications.
E. W. Brown, et al., further say that speech recognition accuracy is typically scored in terms of word error rate (WER). Such can be defined to be the sum of word insertion, substitution, and deletion errors divided by the total number of correctly decoded words. The WER can vary dramatically depending on the nature of the audio recordings.
Speech recognition engines therefore produce flawed results. Voting mechanisms have been constructed to tie the output of many speech recognition engines together, and adopt the result that the majority agrees on, word-by-word. It made no sense in the past to use only two speech recognition engines together in a voting scheme, because there could occur a split decision 1:1.
Some telephony applications use speech recognition engines, e.g., interactive voice response systems like those that offer travel information for an airline. The airline responds with voice-driven flight information by simply asking the user to voice their flight number, or departure and arrival information. Other applications include voice-enabled banking and dialing by voice, e.g., as offered by cellular phone service providers.
Conventional multi-engine speech recognition systems combine three or more ASR engines in a voting arrangement. The combining of the output of only two ASR engines for improvements in the accuracy and decreases in the error rate is not being taught.
Prior work on the combination of automated speech recognition engines depends on the use of three or more ASR engines, never only two. Although the recognition accuracy improves as the number of ASR engines used in the combination increases, using more engines comes at a cost in resources, size, expense, and complexity.
Automatic speech recognition (ASR) engines are far from perfect. The variability in speakers, noise, and delivery rates and styles can confuse the computer processing. So the outputs provided by ASR engines are never very certain, there is always an element of guessing involved. The degree of guessing often varies with the word or sound being attempted in the recognition. Batteries of tests can be conducted on particular ASR engines using various standardized word sets to help characterize and document the recognition accuracy. Such data can be distilled into a “confusion matrix” in which the actual inputs are tabulated against the hypothesized outputs.
Confusion matrices, as used in speech recognition systems, are charts that statistically plot the word or sound recognized versus the word or sound actually input. Confusion matrices have been used in prior art devices for correcting the output from the optical character recognition systems, e.g., as described by Randy Goldberg in U.S. Pat. No. 6,205,261; and in handwritten character recognition, as described by Fujisaki, et al., in U.S. Pat. No. 5,963,666. In U.S. Pat. No. 6,400,805, Brown, et al., describe using a confusion matrix to limit alphanumeric grammar searches.