The present invention relates to methods and systems for defining and handling user/computer interactions. In particular, the present invention relates to dialog systems.
Nearly all modern computer interfaces are based on computer driven interactions in which the user must follow an execution flow set by the computer or learn one or more commands exposed by the computer. In other words, most computer interfaces do not adapt to the manner in which the user wishes to interact with the computer, but instead force the user to interact through a specific set of interfaces.
New research, however, has focused on the idea of having a computer/user interface that is based on a dialog metaphor in which both the user and the computer system can lead or follow the dialog. Under this metaphor, the user can provide an initial question or command and the computer system can then identify ambiguity in the question or command and ask refining questions to identify a proper course of action. Note that during the refinement, the user is free to change the dialog and lead it into a new direction. Thus, the computer system must be adaptive and react to these changes in the dialog. The system must be able to recognize the information that the user has provided to the system and derive a user intention from that information. In addition, the systems must be able to convert the user intention into an appropriate action, such as asking a follow-up question or sending an e-mail message.
Note that the selection of the proper action is critical in that the quality of the user experience is dictated in large part by the number of questions that the system asks the user and, consequently, the amount of time it takes for the user to reach their goal.
In the past, such dialog systems have been created through a combination of technologies. Typically a stochastic model would be used to identify what the user has said. Such models provide probabilities for each of a set of hypothesis phrases. The hypothesis with the highest probability is then selected as the most likely phrase spoken by the user.
This most likely phrase is provided to a natural language parsing algorithm, which applies a set of natural language rules to identify the syntactic and semantic structure of the identified phrase.
The semantic structure is then passed to a plan based system, that applies a different set of rules based on the semantic meaning and the past dialog statements made by the user and the computer. Based on the execution of these rules, the dialog system selects an action that is to be taken.
Some systems have attempted to use stochastic models in the conversion from what was said to the semantic meaning of what was said. For example, in xe2x80x9cThe Thoughtful Elephant: Strategies for Spoken Dialog Systemsxe2x80x9d E. Souvignier et al., IEEE Transactions on Speech and Audio Processing, Vol. 8, No. 1 (January 2000), a stochastic model is applied to both the step of identifying of what has been said and the step of converting what has been said into a semantic meaning.
Other systems have used stochastic models to determine what action to take given a semantic meaning. For example, in xe2x80x9cA Stochastic Model for Machine Interaction for Learning Dialog Strategiesxe2x80x9d, Levin et al., IEEE Transactions on Speech and Audio Processing, Vol. 8, No. 1, pg. 11-23 (January 2000), a stochastic model is used in the conversion from a semantic meaning to an action.
Although stochastic models have been used in each of the stages separately, no system has been provided to use stochastic models in all of the stages of a dialog system that are designed to optimize the same objective function. Because of this, the sub-systems in these dialog systems do not integrate naturally with each other.
Another problem with current dialog systems is that they are not well suited for distributed computing environments with less than perfect quality of service. Telephone based dialog systems, for example, rely heavily on the telephone links. A severance in the phone connection generally leads to the loss of dialog context and interaction contents. As a result, the dialog technologies developed for phone based system cannot be applied directly to Internet environments where the interlocutors do not always maintain a sustained connection. In addition, existing dialog systems typically force the user into a fixed interface on a single device that limits the way in which the user may drive the dialog. For example, current dialog systems typically require the user to use an Internet browser or a telephone, and do not allow a user to switch dynamically to a phone interface or a hand-held interface, or vice versa, in the middle of the interaction. As such, these systems do not provide as much user control as would be desired.
The present invention provides a dialog system in which the subsystems are integrated under a single technology model. In particular, each of the subsystems uses stochastic modeling to identify a probability for its respective output. The combined probabilities identify a most probable action to be taken by the dialog system given the latest input from the user and the past dialog states.
Specifically, a recognition engine is provided that uses a language model to identify a probability of a surface semantic structure given an input from a user. A semantic engine is also provided that uses a semantic model to identify a probability of a discourse structure given the probability of the surface semantic structures. Lastly, a rendering engine is provided that uses a behavior model to determine the lowest cost action that should be taken given the probabilities associated with one or more discourse structures provided by the semantic engine. By using stochastic modeling in each of the subsystems and forcing all the stages to jointly optimize a single objective function, the present invention provides a better integrated dialog system that theoretically should be easier to optimize.
An additional aspect of the present invention is an embodiment in which the recognition engine, the semantic engine and the rendering engine communicate with one another through XML pages, thus allowing the engines to be distributed across a network. By using XML, the dialog systems can take advantage of the massive infrastructure developed for the Internet.
In this embodiment, the behavior model is written or dynamically synthesized using the extensible stylesheet language (XSL) which allows the behavior model to convert the XML pages generated by the semantic engine into an output that is not only the lowest cost action given the discourse representation found in the semantic engine XML page, but is also appropriate for the output interface selected by the user. In particular, the XSL-transformations provided by the behavior model allow a single XML page output by the semantic engine to be converted into a format appropriate for an Internet browser, a phone system, or a hand-held system, for example. Thus, under this embodiment, the user is able to control which interface they use to perform the dialog, and in fact can dynamically change their interface during the dialog.