1. Technical Field
The system and method described herein relate to conversational spoken dialogue systems.
2. Discussion of the Related Art
The main components of a conventional spoken dialogue system are a speech recognizer and an understanding module. During a single operation of the system (i.e., one “recognition pass”), the speech recognizer generates a hypothesis of the words from a spoken utterance, and the understanding module interprets the intent of the spoken query; performing an associated action (such as data retrieval) based upon this interpretation. Mis-recognition by the speech recognizer can therefore result in the implementation of an undesirable or inappropriate action by the understanding module. This mis-recognition can be frustrating to the user, and is a limitation of many conventional spoken dialogue systems.
Speech recognizers typically use a language model to identify the regularities of the language in the specific domain (e.g., airline reservations, medical information, legal issues). There are two basic models for such speech recognition: a grammar-based model and a statistical language model (SLM). In a grammar-based model, specific rules are implemented such that a speech recognizer may recognize phraseology common to the domain. While this generally yields good recognition accuracy, such a model is not robust with regard to the disfluencies associated with speech. An utterance may become unrecognizable to the system when phrased in a manner for which the system is not pre-programmed. Moreover, the inclusion of a greater number of grammar rules quickly reduces the speed of the system, often rendering the system impractical to use. SLMs are typically derived from a large database and are implemented in a fast search algorithm. Yet, while SLMs may be more adept at handling speech disfluencies, they are often less accurate than grammar-based models. Thus, one is left to the trade-off between speed and accuracy.
To improve accuracy without significantly sacrificing speed, some spoken dialog systems utilize an acoustic confidence scoring mechanism that indicates a confidence level in the hypothesis generated by the speech recognizer. Based on the confidence level, an error correction mechanism may be employed (i.e., if the confidence level is sufficiently low, the error correction mechanism is triggered). One such mechanism may request the user to repeat the query, rather than risking the performance of an undesirable or inappropriate action by the system.
Most systems that employ a confidence scoring mechanism do not include any additional mechanism that might improve the confidence score. Instead, an utterance is processed through the system once; a single confidence score is generated, and no effort is made to reevaluate the utterance or hypothesis to improve the score. Furthermore, most systems do not return intelligent feedback to the user, prompting him/her as to which part of the spoken request was unclear.
In one system, a series of confidence scores are derived from the same recognition pass. These scores are generated to identify unreliable words in an utterance; thereby allowing the system to query the user for specific information. However, this system does not implement any mechanism that operates to improve any of the confidence scores if they are low. T. J. Hazen et al., “Recognition Confidence Scoring for Use in Speech Understanding Systems,” Proc. ISCA ASR2000 Tutorial and Research Workshop, Paris (2000).
Accordingly, there is a need for a system and method for speech recognition in a spoken dialogue system that obviates, for practical purposes, the above-mentioned limitations. Such a system and method may offer high-quality speech recognition by reexamining utterances that return a low confidence score when processed through a speech recognizer.