The exemplary embodiment relates to dialog systems and finds particular application in connection with a system and method for tracking a dialog state using collective matrix factorization.
Automated dialog systems interact with users via natural language to help them achieve a goal. As an example, a user may be interested in finding a restaurant and may have a set of constraints, such as geographic location, date, and time. The system offers the name of a restaurant that satisfies the constraints. The user may then request additional information about the restaurant. The dialogue continues until the user's questions are answered. There are many other applications where dialog systems would be advantageous. For example, in the context of customer care, efficient automation could bring a gain of productivity by increasing the probability of success of each call while reducing the overall cost.
The use of autonomous dialog systems is rapidly growing with the spread of smart mobile devices but still faces challenges to becoming a primary user interface for natural interaction using conversations. In particular, when dialogs are conducted in noisy environments or when utterances themselves are noisy, it can be difficult for the system to recognize or understand the user utterances.
Dialog systems often include a dialog state tracker which monitors the progress of the dialogue (dialog and dialogue may be used interchangeably herein). The dialog state tracker provides a compact representation of the past user input and system output in the form of a dialog state. The dialog state encapsulates the information needed to successfully finish the dialogue, such as the user's goal or requests. The term “dialog state” loosely denotes a representation of the knowledge of user needs at any point in a dialogue. The precise nature of the dialog state depends on the associated dialog task. An effective dialog system benefits from a state tracker which is able to accumulate evidence, in the form of observations, accurately over the sequence of turns of a dialogue, and adjust the dialog state according to the observations. However, in spoken dialog systems, where the user utterance is input as a voice recording, the errors incurred by Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) means that the true user utterance may not be directly observable. This makes it difficult to compute the true dialog state.
A common mathematical representation of a dialog state is a slot-filling schema. See, for example, Williams, et al., “Partially observable Markov decision processes for spoken dialog systems,” Computer Speech & Language, 21(2):393-422, 2007, hereinafter, “Williams 2007”. In this approach, the state is composed of a predefined set of variables with a predefined domain of expression for each of them. The goal of the dialog system is to instantiate each of the variables efficiently in order to perform an associated task and satisfy the corresponding intent of the user. In the restaurant case, for example, this may include, for each of a set of variables, a most probable value of the variable, such as: location: downtown; date: August 14; time: 7.30 pm; restaurant type: Spanish, (or unknown if the variable has not been assigned a different value).
Various approaches have been suggested for defining dialog state trackers. Some systems use hand-crafted rules that rely on the most likely result from an NLU module. However, these rule-based systems are prone to frequent errors as the most likely result is not always correct. Moreover, these systems often drive the customer to respond using simple keywords and to confirm everything they say explicitly, which is far from a natural conversational interaction. See, Williams, “Web-style ranking and SLU combination for dialogue state tracking,” Proc. SIGDIAL, pp. 282-291, June 2014. More recent methods take a statistical approach to estimating the posterior distribution over the dialog states using the results of the NLU step. Statistical dialog systems, in maintaining a distribution over multiple hypotheses of the true dialog state, are able to behave in a robust manner when faced with noisy conditions and ambiguity.
Statistical dialog state trackers can be categorized into two general approaches (generative and discriminative), depending on how the posterior probability distribution over the state calculation is modeled. The generative approach uses a generative model of the dialog dynamic that describes how the NLU results are generated from the hidden dialog state and uses the Bayes rule to calculate the posterior probability distribution. The generative approach has been a popular approach for statistical dialog state tracking, since it naturally fits into the Partially Observable Markov Decision Process (POMDP) type of modeling, which is an integrated model for dialog state tracking and dialog strategy optimization. See, Young, et al., “POMDP-based statistical spoken dialog systems: A review,” Proc. IEEE, 101(5):1160-1179, 2013. In the context of POMDP, dialog state tracking is the task of calculating the posterior distribution over the hidden states, given the history of observations.
The discriminative approach aims at directly modeling the posterior distribution through an algebraic closed formulation of a loss minimization problem.
Generative systems are described, for example, in Williams 2007; Williams, “Exploiting the ASR n-best by tracking multiple dialog state hypotheses,” INTERSPEECH, pp. 191-194, 2008; Williams, “Incremental partition recombination for efficient tracking of multiple dialog states,” ICASSP, pp. 5382-5385, 2010; Thomson, et al., “Bayesian update of dialog state: A POMDP framework for spoken dialogue systems,” Computer Speech & Language, 24(4):562-588, 2010, hereinafter, “Thomson 2010.”
Discriminative systems are described, for example, in Paek, et al., “Conversation as action under uncertainty,” UAI '00: Proc. 16th Conf. in Uncertainty in Artificial Intelligence, pp. 455-464, 2000, and in Thomson 2010. The successful use of discriminative models for belief tracking has recently been reported in Williams, “Challenges and opportunities for state tracking in statistical spoken dialog systems: Results from two public deployments,” J. Sel. Topics Signal Processing, 6(8):959-970, 2012; Henderson, et al., “Deep Neural Network Approach for the Dialog State Tracking Challenge,” Proc. SIGDIAL 2013, pp. 467-471, 2013).
Each of these statistical approaches suffers from some limitations, such as complex inference at test time, scalability, or restrictions on the set of possible state variables in learning.