1. Field of the Invention
The present invention relates generally to a system and method for spoken dialog systems.
2. Background Discussion
Automatic spoken dialog systems are often very complex. They may consist of hundreds of dialog states involving extensive dialog structures, have system integration functionality that communicates with backend databases or devices, support multiple input and output modalities, and can sometimes handle calls over more than 20 minutes in duration. In order to keep a caller engaged in such environments, the use of human-like speech processing is critical, e.g., the incorporation of various degrees of spoken language understanding, mixed-initiative handling, and dynamic response generation. One type of spoken language understanding, called natural language understanding, was first introduced on a large scale to automated spoken dialog systems as call classifiers. Here, the caller was asked a general question at the top of the call, such as, “Briefly tell me what you're calling about today.” The caller's utterance was transcribed using a speech recognizer, and the caller was routed to a human agent based on a classification of the utterance produced by a semantic classifier. The human agent then interacted with the caller by providing services including, e.g., technical problem solving, billing support, or order processing.
Typically, spoken dialog systems are built using semantic classifiers for most or all of the dialog contexts, both for natural language as well as for directed dialog inputs. A semantic classifier is a program that provides a mapping between utterances a speech recognizer produces and one or more predefined semantic classes which represent different categories of meaning. Semantic classifiers can be rule-based, i.e. manually generated as a set of rules that provide said mapping, or statistical, i.e. based on a statistical classification model whose parameters are trained from data, i.e. transcribed training utterances (transcriptions) and their respective semantic meanings (annotations). There can also be combinations of rule-based and statistical classifiers. Statistical semantic classifiers are today used almost exclusively for natural language input, while rule-based classifiers are typically used for directed dialog input.
Once a spoken dialog system goes into production with the set of classifiers designed for the application, the system's performance may suffer due to a variety of reasons, e.g.:    1. semantic classifiers were built with no collected data (rather, rules were created out of the designer's expectation of what people would say in this specific recognition context),    2. semantic classifiers were built to span several contexts while callers actually behave differently given the context,    3. semantic classifiers were built on small amounts of data,    4. semantic classifiers were built on old or unrepresentative data.
Spoken dialog systems are often designed to emulate a human agent's role in the complexity of the services offered as well as in the length of interaction. At the same time, as dialog systems improve, so too do the expectations of callers. Several characteristics of modern dialog system design encourage callers to behave as if they were interacting with a human agent. Such characteristics include open-ended questions during the conversation and global commands such as “help” and “repeat” at every point in the dialog. This design encourages callers to say things that are not explicitly prompted by the context prompts in the dialog system. Furthermore, directed dialog prompts in which callers are asked to choose an item from a list often unintentionally elicit out-of scope utterances from callers by offering choices that may be incomplete, too vague, or too specific.
Also, classifiers used in different parts of a spoken dialog system may perform excellently on average but exhibit worse behavior in some individual contexts.