The exemplary embodiment relates to dialog systems and finds particular application in connection with a system and method for tracking a dialog state using probabilistic matching with an ontology.
Automated dialog systems interact with users in a natural language to help them achieve a goal. As an example, a user may be interested in finding a restaurant and may have a set of constraints, such as geographic location, date, and time. The system offers the name of a restaurant that satisfies the constraints. The user may then request additional information about the restaurant. The dialogue continues until the user's questions are answered. There are many other applications where dialog systems could be advantageous. For example, in the context of customer call centers, efficient automation could bring a gain of productivity by increasing the probability of success of each call while reducing the overall cost.
The use of autonomous dialog systems is rapidly growing with the spread of smart mobile devices but still faces challenges to becoming a primary user interface for natural interaction using conversations. In particular, when dialogs are conducted in noisy environments or when utterances themselves are noisy, it can be difficult for the system to recognize or understand the user utterances.
Dialog systems often include a dialog state tracker which monitors the progress of the dialogue (dialog and dialogue may be used interchangeably herein). The dialog state tracker provides a compact representation of the past user input and system output in the form of a dialog state. The dialog state encapsulates the information needed to successfully finish the dialogue, such as the user's goal or requests. The term “dialog state” loosely denotes a representation of the knowledge of user needs at any point in a dialogue. The precise nature of the dialog state depends on the associated dialog task. An effective dialog system benefits from a state tracker which is able to accumulate evidence, in the form of observations, accurately over the sequence of turns of a dialogue, and adjust the dialog state according to the observations. However, in spoken dialog systems, where the user utterance is input as a voice recording, the errors incurred by Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) means that the true user utterance may not be directly observable. This makes it difficult to compute the true dialog state.
A common mathematical representation of a dialog state is a slot-filling schema. See, for example, Williams, et al., “Partially observable Markov decision processes for spoken dialog systems,” Computer Speech & Language, 21(2):393-422, 2007, hereinafter, “Williams 2007,” In this approach, the state is composed of a predefined set of variables with a predefined domain of expression for each of them. The goal of the dialog system is to instantiate each of the variables efficiently in order to perform an associated task and satisfy the corresponding intent of the user. In the restaurant case, for example, this may include, for each of a set of variables, a most probable value of the variable, such as: location: downtown; date: August 14; time: 7.30 pm; restaurant type: Spanish, (or unknown if the variable has not yet been assigned).
Various approaches have been suggested for defining dialog state trackers. Some systems use hand-crafted rules that rely on the most likely result from an NLU module. See, Williams, “Web-style ranking and SLU combination for dialogue state tracking,” Proc. SIGDIAL, pp. 282-291, June 2014; Nuance Communications, ‘Grammar developers guide. Technical report,” Nuance Communications, 1380 Willow Road, Menlo Park, Calif. 94025, 2007. More recent methods take a statistical approach to estimating the posterior distribution over the dialog states using the results of the NLU step. Statistical dialog systems, in maintaining a distribution over multiple hypotheses of the true dialog state, are able to behave in a more robust manner when faced with noisy conditions and ambiguity.
Statistical dialog state trackers can be categorized into two general approaches (generative and discriminative), depending on how the posterior probability distribution over the state calculation is modeled. The generative approach uses a generative model of the dialog dynamic that describes how the NLU results are generated from the hidden dialog state and uses the Bayes rule to calculate the posterior probability distribution. Generative systems are described, for example, in Williams 2007; Williams, “Exploiting the ASR n-best by tracking multiple dialog state hypotheses,” INTERSPEECH, pp. 191-194, 2008; and Williams, “Incremental partition recombination for efficient tracking of multiple dialog states,” ICASSP, pp. 5382-5385, 2010. The generative approach has been popular for statistical dialog state tracking, since it naturally fits into the Partially Observable Markov Decision Process (POMDP) type of modeling, which is an integrated model for dialog state tracking and dialog strategy optimization. See, Young, et al., “POMDP-based statistical spoken dialog systems: A review,” Proc. IEEE, 101(5):1160-1179, 2013. In the context of POMDP, dialog state tracking is the task of calculating the posterior distribution over the hidden states, given the history of observations.
The discriminative approach aims at directly modeling the posterior distribution through an algebraic closed formulation of a loss minimization problem. Discriminative systems are described, for example, in Paek, et al., “Conversation as action under uncertainty,” UAI '00: Proc. 16th Conf. in Uncertainty in Artificial Intelligence, pp. 455-464, 2000; and Thomson, et al., “Bayesian update of dialog state: A POMDP framework for spoken dialogue systems,” Computer Speech & Language, 24(4):562-588, 2010.
A primary drawback of these two statistically-based approaches is the need of extensive data to embed the knowledge for inferring a state tracking model. While gathering data is often a feasible task, annotating a gathered dialog corpus can be time-consuming and costly. Virtual annotation based on prior linguistic knowledge, such as grammar, has been proposed. (Deepak Ramachandran, et al., “An end-to-end dialog system for TV program discovery,” SLT, pp. 602-607, IEEE, 2014).