Dialog includes language of a conversation between participants as well as a shared central context constructed by the participants to a conversation (e.g., references later in a conversation to “it” refer to something described earlier in the conversation). The participants of the conversation may be human, machine, or any combination of humans and machines. Dialog Management includes interpretation of speaker utterances with respect to the shared context, as well as techniques and strategies for managing the interaction between the dialog participants. Activity-oriented dialogue systems have been in development for applications such as multimodal control of robotic devices, speech-enabled tutoring systems, and conversational interaction with in-car devices. The typical dialogue system architecture includes various components like speech-recognizers, language parsers, language generators, speech-synthesizers, and Dialogue Managers (“DM”). This dialogue system can also include connections to external application-specific components such as ontologies or knowledge bases (“KB”), as well as dialogue-enabled devices. See the following for examples of Dialog Systems: (i) Lemon, O., A. Gruenstein, S. Peters (2002), “Collaborative activities and multi-tasking in dialogue systems”, Traitement Automatique des Langues (TAL), 43(2); (ii) Clark, B., J. Fry, M. Ginzton, S. Peters, H. Pon-Barry, Z. Thomsen-Grey (2001), “Automated tutoring dialogues for training in shipboard damage control”, SIGdial; and (iii) Weng, F., L. Cavedon, B. Raghunathan, D. Mirkovic, H. Cheng, H. Schmidt, H, Bratt, R. Mishra, S. Peters, L. Zhao, S. Upson, L. Shriberg, C. Bergmann (2004), “A conversational dialogue system for cognitively overloaded users (poster)”, INTERSPEECH.
The DM of a dialogue system is an oversight module that facilitates the interaction between dialogue participants. The dialogue system using Activity Models is specific to a type of dialogue, referred to as “activity-oriented dialogue”, which is dialogue about activities being (jointly) carried out by a user and a machine, computer, and/or robot. In a user- or speaker-initiated system, the DM directs the processing of an input utterance from one component to another through interpretation and back-end system response. In the process, the DM detects and handles information inputs of an input utterance, and generates system output, for example. The DM may be used with different parsers and language-generation components. Interaction with external devices is mediated by Activity Models (“AMs”), i.e. declarative specifications of device capabilities and their relationships to linguistic processes. However, customization to new domains has generally required some significant programming effort, due to variations in dialogue move requirements across applications, representation variation in interface to the language parser and other components, as well as certain processes (e.g. reference resolution) having domain-specific aspects to them.
The conventional dialogue management systems range from the commercially widely-used yet more constrained dialogue-modeling mechanisms based on voice extensible markup language (“VXML”), to semantic models based on the TrindiKit approach to information-state update. While many dialogue systems are designed and implemented for specific domains, these systems require significant engineering to apply to new domains. Conversely, a dialogue management infrastructure based on VXML allows flexible implementation of speech-based dialogue systems for new domains, but provides only shallow solutions to many issues in dialogue modeling.
Present conventional dialogue management systems also provide limited capabilities for processing confidence scores generated by a speech recognizer unit and/or other sources within the dialogue system. In a multi-device system, determining which device an utterance is directed at is not always straightforward. Although one can use the resolution of noun-phrase (NP) arguments as disambiguating information, the NP-resolution process itself is often device-specific, thus preventing NP's from being properly resolved until the appropriate device has been determined.