1. Field of the Invention
The present invention relates generally to a system and method for spoken dialog systems.
2. Background Discussion
Automatic spoken dialog systems are often very complex. They may consist of hundreds of dialog states involving extensive dialog structures, have system integration functionality that communicates with backend databases or devices, support multiple input and output modalities, and can sometimes comprise more than 20 minutes in call duration. In order to keep a caller engaged in such environments, the use of human-like speech processing is critical, e.g., the incorporation of various degrees of spoken language understanding, mixed-initiative handling, and dynamic response generation. One type of spoken language understanding, called natural language understanding, on a large scale was first introduced to automated spoken dialog systems as call classifiers. Here, the caller was asked a general question at the top of the call, such as, “Briefly tell me what you're calling about today.” The caller's utterance was transcribed using a speech recognizer, and the caller was routed to a human agent based on a class of the utterance produced by a semantic classifier. The human agent then interacted with the caller providing services including, e.g., technical problem solving, billing support, or order processing. Other interactions may not require free form natural language input from the caller, but the speaking of simple commands as instructed by prompts, like yes or no, typically referred to as directed dialog input.
Typically, spoken dialog systems are built using semantic classifiers for most or all of the dialog contexts, both for natural language as well as for directed dialog inputs. A semantic classifier is a program that provides a mapping between utterances a speech recognizer produces and one or more predefined semantic classes which represent different categories of meaning. Semantic classifiers can be rule-based, i.e. manually generated as a set of rules that provide said mapping, or statistical, i.e. based on a statistical classification model whose parameters are trained from data, i.e. transcribed training utterances (transcriptions) and their respective semantic meanings (annotations). There can also be combinations of rule-based and statistical classifiers. Statistical semantic classifiers are today used almost exclusively for natural language input, while rule-based classifiers are typically used for directed dialog input.
Once a spoken dialog system goes into production with the set of classifiers designed for the application, the system's performance may suffer due to a variety of reasons, e.g.:                0. semantic classifiers were built with no data (rules created out of the designer's expectation of what people would say in this specific recognition context),        1. semantic classifiers were built to span over several contexts while callers actually behave specifically to some of the contexts,        2. semantic classifiers were built on small amounts of data,        3. semantic classifiers were built on old or unrepresentative data.        
Spoken dialog systems are often designed to emulate a human agent's role in the complexity of the services offered as well as in the length of interaction. At the same time, as dialog systems improve, so too do the expectations of callers. Several characteristics of modem dialog system design encourage callers to behave as if they were interacting with a human agent. Such characteristics include open-ended questions during the conversation and global commands such as “help” and “repeat” at every point in the dialog. This design encourages callers to say things that are not explicitly prompted by the context prompts in the dialog system. Furthermore, explicit directed dialog prompts in which callers are asked to choose an item from a list often unintentionally elicit out-of scope utterances from callers by offering choices that may be incomplete, too vague, or too specific.
Caller's behavior, however, is often unpredictable to an interaction designer. Even listening to hundreds of calls will hardly provide a broad understanding of what exactly is going on at every point in a dialog system that receives millions of calls every month. It is barely possible to satisfy this expectation with the still-common approach of using static, hand-crafted, rule-based semantic classifiers.