Speech recognition systems have received increased attention lately and are becoming popular. Speech recognition technology is being used more and more in a wide range of technology areas ranging from security systems and automated response systems to a variety of electronic devices such as computers.
Speech recognition systems are also used as a command input device for controlling back-end applications such as car navigation systems or home entertainment systems. Accuracy of the speech recognition is very important especially when it is used to control a back-end application, since the user generally will not be given a chance to correct a mis-recognized word before the back-end application proceeds to take the unintended action. If the speech recognition is inaccurate, the speaker will have to re-enter the speech command after the unintended action has been taken by the back-end application.
Conventional speech recognition systems typically attempt to recognize speech by processing the input speech signal using a single speech recognizer. These speech recognizers could be a grammar-based speech recognizer, a statistical speech recognizer, or any other type of speech recognizer known in the art. However, these different types of conventional speech recognizers have their own strengths and weaknesses, and are not capable of accurate recognition over a broad range of speech.
Grammar-based speech recognizers compares templates, defined by specified grammar rules and pronunciation dictionaries, of sequence of words with the input speech signal, and selects the word sequence with the highest confidence level as its output result. These grammar rules are created manually with knowledge of how the user of the speech recognizer would be expected to speak. Grammar-based speech recognizers tend to be more accurate than statistical speech recognizers in speech recognition tasks involving relatively little variability in the way that a user would express himself. However, grammar-based speech recognizers are restrictive but accurate in what they can recognize, since they typically utilize a small number of sequence of words for comparison with the input speech.
Statistical speech recognizers use statistical language models in place of the grammar rules of the grammar-based speech recognizers in comparing the input speech. Statistical language models are created using a corpus of transcribed examples of actual speech. From this corpus, the statistical language models infer the probability of the following word, given the previous word or words. These words may be replaced by any token, such as a start-of-utterance, or by a class such as “Business-Name.” Each class is then associated with a list of words and phrases, which may be created manually or automatically created from another corpus. Statistical speech recognizers perform much better than grammar-based recognizers when the variability of the input speech is high or unknown. They are better at recognizing non-grammatical utterances, and to some extent unpredicted speech out of the context. However, the success of the statistical speech recognizers depends on the quality and quantity of data in training the corpus. In addition, statistical speech recognizers typically do not incorporate any understanding of the meaning of the words that they recognize.
Therefore, there is a need for a speech recognition system that takes advantage of the strengths, while complementing the weaknesses, of different types of conventional speech recognizers. There is also a need for a speech recognition system that uses multiple speech recognizers having different strengths and weaknesses. There is also a need for a speech recognition system that can select the most accurate one of the speech recognition results output from the multiple speech recognizers.