1. Field of the Invention
The present invention relates generally to a system and method for spoken dialog systems.
2. Background Discussion
Automatic spoken dialog systems are often very complex. They may consist of hundreds of dialog states involving extensive dialog structures, have system integration functionality that communicates with backend databases or devices, support multiple input and output modalities, and can sometimes handle calls over more than 20 minutes in duration. In order to keep a caller engaged in such environments, the use of human-like speech processing is critical, e.g., the incorporation of various degrees of spoken language understanding, mixed-initiative handling, and dynamic response generation. One type of spoken language understanding, called natural language understanding, on a large scale was first introduced to automated spoken dialog systems as call classifiers. Here, the caller was asked a general question at the top of the call, such as, “Briefly tell me what you're calling about today.” The caller's utterance was transcribed using a speech recognizer, and the caller was routed to a human agent based on a class of the utterance produced by a semantic classifier. The human agent then interacted with the caller providing services including, e.g., technical problem solving, billing support, or order processing. Other interactions may not require free form natural language input from the caller, but the speaking of simple commands as instructed by prompts, like yes or no, typically referred to as directed dialog input.
Typically, spoken dialog systems are built using semantic classifiers for most or all of the dialog contexts, both for natural language as well as for directed dialog inputs. A semantic classifier is a program that provides a mapping between utterances a speech recognizer produces and one or more predefined semantic classes which represent different categories of meaning. Semantic classifiers can be rule-based, i.e. manually generated as a set of rules that provide said mapping, or statistical, i.e. based on a statistical classification model whose parameters are trained from data, i.e. transcribed training utterances (transcriptions) and their respective semantic meanings (annotations). There can also be combinations of rule-based and statistical classifiers. Statistical semantic classifiers are today used almost exclusively for natural language input, while rule-based classifiers are typically used for directed dialog input.
Modern spoken dialog systems can be very complex applications comprising thousands of activities, classifiers, and prompts. Years of developing work can be spent to design these systems and much effort undertaken to tune involved speech recognition classifiers to achieve highest possible performance crucial for user acceptance and effectiveness of the applications. Such tuning can require processing of huge numbers of calls to analyze caller behavior in every single context of the system, building of recognition classifiers to effectively interpret caller utterances, and designing the application to respond appropriately at every context.
In an example, to tune a spoken dialog system for Internet, cable TV, and Voice-over-IP troubleshooting, more than two million speech utterances can be collected, transcribed, annotated, and used for training statistical classifiers, boosting overall accuracy from an initial 78.0% to 90.5% accuracy. Although transcription and annotation of such amounts of data is partially automatable, it can still keep several people busy for months. While transcription is a relatively straightforward exercise, semantic annotation, i.e. the mapping of a lexical content to one of a number of semantic symptoms, requires knowledge about the application. Not only must annotators understand what a caller utterance means in response to the system prompt in the respective context, but there are several aspects to semantic annotation making it a non-trivial undertaking, such as                Utterances may have no representation in the given set of symptoms suggesting that they are out-of-scope for the classifier.        When the ratio of out-of-scope utterances grows and well-distinguishable patterns manifest themselves, annotators are to suggest the introduction of new symptoms to the system designer.        Utterances may be ambiguous, vague, too specific, or carry content belonging to multiple symptoms making it hard for the annotator to make a decision.        Annotations have to follow a number of quality assurance criteria to produce powerful and exact results including criteria for completeness, consistency, congruence, correlation, confusion, coverage, and corpus size (i.e., “C7” criteria”).        
These issues emphasize that thorough speech recognition tuning in spoken dialog systems can be a very expensive task. Large scale spoken dialog systems as introduced above are mostly used in relatively big enterprises trying to optimize their customer care telephone portals. Many of these companies operate internationally producing a need to localize their phone services including involved spoken dialog systems. Localization of a dialog system entails translating it from one language to another. The high cost of producing and maintaining systems in different languages obviously increases as more languages are considered. Not only the cost, but also the time to generate speech recognition classifiers from scratch is a crucial issue when localizing a given spoken dialog system.