This invention relates adapting a speech recognition system to be capable of dealing with a plurality of speech recognition contexts in an efficient manner.
Speech recognition systems are well known to the art. Examples include the IBM Tangora 10! and Dragon Systems Dragon 30 k dictation systems. Typically, they are single user, and speaker-dependent. This requires each speaker to train the speech recognizer with his or her voice patterns, during a process called "enrollment". The systems then maintain a profile for each speaker, who must identify themselves to the system in future recognition sessions. Typically speakers enroll via a local microphone in a low noise environment, speaking to the single machine on which the recognizer is resident. During the course of enrollment, the speaker will be required to read a lengthy set of transcripts, so that the system can adjust itself to the peculiarities of each particular speaker.
Discrete dictation systems, such as the two mentioned above, require speakers to form each word in a halting and unnatural manner, pausing, between, each, word. This allows the speech recognizer to identify the voice pattern associated each individual word by using preceding, and following, silences to bound the words. The speech recognizer will typically have a single application for which it is trained, operating on the single machine, such as Office Correspondence in the case of the IBM Tangora System.
Multi-user environments with speaker dependent speech recognizers require each speaker to undertake tedious training of the recognizer for it to understand his or her voice patterns. While it has been suggested that the templates which store the voice patterns may be located in a common database wherein the system knows which template to use for a speech recognition by the speaker telephone extension, each speaker must none-the-less train the system before use. A user new to the system calling from an outside telephone line will find this procedure to be unacceptable. Also, the successful telephonic speech recognizer will be capable of rapid context switches to allow speech related to various subject areas to be accurately recognized. For example, a system trained for general Office Correspondence will perform poorly when presented with strings of digits.
The Sphinx system, first described in the Ph.D. Dissertation of Kai-Fu Lee 1!, represented a major advance over previous speaker-dependent recognition systems in that it was both speaker independent, and capable of recognizing words from a continuous stream of conversational speech. This system required no individualized speaker enrollment prior to effective use. Some speaker dependent systems require speakers to be reenrolled every four to six weeks, and require users to carry a personalized plug-in cartridge to be understood by the system. Also with continuous speech recognition, no pauses between words are required, thus the Sphinx system represents a much more user friendly approach to the casual user of a speech recognition system. This will be an essential feature of telephonic speech recognition systems, since the users will have no training in how to adjust their speech for the benefit of the recognizer.
A speech recognition system must also offer real time operation with a given modest vocabulary. However, the Sphinx System still had some of the disadvantages of the prior speaker dependent recognizers in that it was programmed to operate on a single machine in a low noise environment using a microphone and a relatively constrained vocabulary. It was not designed for multi-user support, at least with respect to the different locations, and multiple vocabularies for recognition.
This invention overcomes many of the disadvantages of the prior art.