Speech-enabled applications, enabling users to interact with machines using speech as a control mode, are becoming more prevalent with advances in technology.
Natural-language speech enabled systems attempts to closely emulate human-human interaction and ideally allow users to speak in a natural manner. Such systems ask open ended questions like “How May I Help You?” to the user and allow the user to respond in the user's own desired manner, a manner over which the system has no control. In order to accommodate this user flexibility, a natural-language-based speech recognizer must have a relatively large vocabulary, and a relatively large grammar, which tend to result in poor recognition accuracy. Moreover, in order to understand the free-form response, which is typical of such systems, natural-language-based systems also require a high level of natural language understanding.
On the other hand, dialog-based speech enabled systems ask very specific questions of the user and each question requires a specific response that is restricted to a set of pre-defined inputs as decided by the system. Dialog-based systems ask the user a specific question (also referred to as a “prompt”), and based upon the user's response, the dialog-based system progresses in a particular (pre-defined) order to thereby acquire sufficient information from the user to perform the desired action. Dialog-based systems exploit the limited context which results from the dialog-based approach, in order to improve recognition accuracy. Consequently, in the dialog-based system, a speech recognizer only needs to handle small grammars when processing the response elicited by each prompt in the generated dialog. This approach also reduces the size of the vocabulary required by the recognizer. The recognition accuracy of dialog-based speech recognition systems can accordingly be increased. However dialog-based systems force the user to model his or her response in a system-defined manner. Another disadvantage of dialog-based systems is the fact that the user has to traverse the prompt/response tree in order to obtain the desired information that resides at a specified leaf of such a tree.
In dialog-based systems, the inputs to the system are typically referred to as “slots” (also referred to as “fields” or “information fields” in this description), where a pre-defined set of slots is needed by application in order to perform a corresponding task. Each member slot is associated with a specific type of information. Typical dialog-based arrangements use a “system-initiated” approach, also known as directed-dialog approach, in which the user must respond to prompts from the system precisely in the order defined by the system. In such arrangements, specific grammar is defined along with a suitable prompt to elicit information to fill a particular slot. Multiple slots typically can not be filled based upon a single user utterance. Furthermore, the user utterance can not be used to fill any other slot other than the one for which information has been solicited. This approach results in rigid system-directed interaction which makes the interaction long and monotonous for the user, often resulting in user dissatisfaction.
To overcome these problems and make dialog-based system more flexible, mixed-initiative dialog systems have been developed. In mixed-initiative systems the user need not make a response which is strictly compliant with the prompt. The user response can also be used to fill a slot other than the slot that is directly associated with the current prompt. Furthermore, more than one slot can be filled on the basis of a single user utterance. This approach places some control with the user who consequently has some flexibility of approach in filling the slots, and both the computer and the user play a role in directing the dialog.
Mixed initiative systems require composite grammars (also referred to as Mixed-Initiative or MI grammars in this description) which allow slots to be filled arbitrarily. Existing mixed-initiative systems are however inflexible, complex and not easily portable across applications.
The Voice Extensible Markup Language (VXML) specification of the World Wide Web Consortium (W3C) provides constructs for writing MI dialogs. The VXML “form-level grammar” allows more than one field to be filled using a single user utterance. It is also possible to fill up information fields other than those being asked about by the system. The VXML construct “initial” together with form-level grammar and the VXML “Form Interpretation Algorithm” (FIA) are used in MI applications using VXML. However, these VXML constructs enable only very primitive mixed-initiative dialog systems. In particular, the prompts presented by such systems typically do not correspond well with the information to be collected from the user. There is no mechanism to enable information collection for only a subset of slots among the initial set of MI slots in a dialog interaction. The support for “confirmation” and “disambiguation” is not robust. The resulting systems are inflexible and can neither be easily configured for different behaviour, nor easily ported for different applications.
Agarwal et al. (R. Agarwal, B. M. Shahshahani, “Method and Apparatus for Providing A Mixed-Initiative Dialog Between A User and A Machine”, US Patent Application US2004/0085162 A1, May 6, 2004) presents a mixed-initiative dialog system that presents a natural language speech interface to the user. The speech recognizer in Agarwal uses statistical language models. Agarwal uses Natural Language Processing (NLP) to parse a user utterance in order to obtain the information needed to fill various slots. However, as discussed, natural language speech approaches are very prone to recognition error, with consequent lack of accuracy. Furthermore, use of NLP for parsing adds further recognition errors and system complexity.