1. Technical Field
This invention relates to a speech recognition and voice response system capable of dealing with a plurality of speech recognition contexts in an efficient manner.
2. Background Art
This patent application is related to U.S. patent application Ser. No. 07/947,634 entitled "Instantaneous Context Switching for Speech Recognition Systems," by V. M. Stanford, et al., assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to U.S. patent application Ser. No. 07/968,097 entitled "Method for Word Spotting in Continuous Speech," by J. H. Garman, et al., (now U.S. Pat. No. 5,513,298 issued Apr. 30, 1996) assigned to the IBM Corporation and incorporated herein by reference.
Speech recognition systems are well-known to pending the art. Examples include the IBM Tangora [10] and Dragon Systems Dragon 30k dictation systems. Typically, they are single user, and speaker-dependent. This requires each speaker to train the speech recognizer with his or her voice patterns, during a process called "enrollment." The systems then maintain a profile for each speaker, who must identify themselves to the system in future recognition sessions. Typically speakers enroll via a local microphone in a low noise environment, speaking to the single machine on which the recognizer is resident. During the course of enrollment, the speaker will be required to read a lengthy set of transcripts, so that the system can adjust itself to the peculiarities of each particular speaker.
Discrete dictation systems, such as the two mentioned above, require speakers to form each word in a halting and unnatural manner, pausing, between, each, word. This allows the speech recognizer to identify the voice pattern associated each individual word by using preceding, and following, silences to bound the words. The speech recognizer will typically have a single application for which it is trained, operating on the single machine, such as Office Correspondence in the case of the IBM Tangora System.
Multi-user environments with speaker dependent speech recognizers require each speaker to undertake tedious training of the recognizer for it to understand his or her voice patterns. While it has been suggested that the templates which store the voice patterns may be located in a common data base wherein the system knows which template to use for a speech recognition by the speaker telephone extension, each speaker must none-the-less train the system before use. A user new to the system calling from an outside telephone line will find this procedure to be unacceptable. Also, the successful telephonic speech recognizer will be capable of rapid context switches to allow speech related to various subject areas to be accurately recognized. For example, a system trained for general Office Correspondence will perform poorly when presented with strings of digits.
The Sphinx system, first described in the Ph.D. Dissertation of Kai-Fu Lee, represented a major advance over previous speaker-dependent recognition systems in that it was both speaker independent, and capable of recognizing words from a continuous stream of conversational speech. This system required no individualized speaker enrollment prior to effective use. Some speaker dependent systems require speakers to be re-enrolled every four to six weeks, and require users to carry a personalized plug-in cartridge to be understood by the system. Also with continuous speech recognition, no pauses between words are required, thus the Sphinx system represents a much more user friendly approach to the casual user of a speech recognition system. This will be an essential feature of telephonic speech recognition systems, since the users will have no training in how to adjust their speech for the benefit of the recognizer.
A speech recognition system must also offer real time operation with a given modest vocabulary. However, the Sphinx System still had some of the disadvantages of the prior speaker dependent recognizers in that it was programmed to operate on a single machine in a low noise environment using a microphone and a relatively constrained vocabulary. It was not designed for multi-user support, at least with respect to the different locations, and multiple vocabularies for recognition.
The above cited V. M. Stanford, et al. patent application overcomes many of the disadvantages of the prior art. The speech recognition system is divided into a number of modules including a front end which converts the analog or digital speech data into a set of Cepstrum coefficients and vertical quantization values which represent the speech. A back end uses the vector quantization values and recognizes the words according to phoneme models and word pair grammars as well as the context in which the speech made. By dividing the vocabulary into a series of contexts, situations in which certain words are anticipated by the system, a much larger vocabulary can be accommodated with minimum memory. As the user progresses through the speech recognition task, contexts are rapidly switched from a common data base. The system also includes an interface between a plurality of user applications also in the computer network.
The system includes training modules, training and task build modules to train the system and to build the word pair grammars for the context respectively.
The ideal man-machine or microprocessor interface allows the user to talk naturally back and forth with the machine. This natural dialogue is important for virtually all microprocessor-based applications. Interactive, speech-driven dialogue is key to making many applications human centric. For example:
Help systems for computers, operating systems and consumer goods. PA1 Interactive educational multimedia programs. PA1 Executive information systems. PA1 Portable speech-driven electronic mail and voice mail systems. PA1 Interactive translation software. PA1 Speech-driven kiosks. PA1 Frequently users slur their words or imbed unnecessary "ums, ahs, and pauses" in their speech. PA1 Users frequently forget what words, phrases and questions the computer understands and thus create phrases like, "What is the latest news, or was I um, ah, supposed to say what is the latest research, or is it the most recent research on IBM..." PA1 Users sometimes ignore prompts. PA1 Users forget predefined command vocabularies. PA1 Users resist training systems. The ideal solution must be "walk-up-and use. " PA1 Text-to-speech subsystems continue to sound unnatural. Computers reading ASCII or EBCIDIC text files continue to sound inebriated, and are frequently unintelligible by non-native speakers. PA1 Users resist speaking slowly or in an isolated word mode.
A natural interactive, speech-driven, user interface has a number of technical problems: