Traditionally natural language understanding (NLU) models operate independently from dialogue models and output from the NLU model was simply provided to the dialogue model in a pipeline fashion. Such a conventional approach is sensitive to errors from the NLU model.
In the last decade, natural language understanding and dialogue management have taken on increased importance due to incorporation of conversational systems, e.g., digital assistants in various devices. The goal of conversational systems is to enable users to provide natural language input to a conversational system, which the conversational system can use to assist the user in completing tasks more efficiently. A typical pipeline of tasks in traditional conversational systems is to use an NLU model to parse user utterances into semantic frames to capture meaning. Typically, the first task in the NLU model is to decide the domain given the input utterance, and based on the domain, the second task is to predict the intent, and then the third task is to fill associated slots corresponding to a domain-specific semantic template. The next step in the pipeline is passing the output from the NLU model to a separate dialogue manager (DM) model. In the DM model, the task is to monitor belief distribution over possible user states underlying current user behaviors, and based on the belief distribution to predict system actions.
Such traditional approaches have several disadvantages. Traditional approaches for NLU usually model tasks of domain/intent classification and slot filling separately and employ sequential labeling methods, e.g. hidden Markov models (HMMs) and conditional random field (CRF) are widely used in slot tagging tasks; maximum entropy and support vector machines with linear kernel (LinearSVM) are often applied to user intent prediction. These models rely on careful feature engineering that is laborious and time consuming. Applying deep learning techniques, recurrent neural networks and CRF modeling has improved expressive feature representations in NLU modeling, and convolutional neural networks have improved domain/intent classification. However, even though slot tags and intents, as semantics representations of user behaviors, may share knowledge with each other, separate modeling of these two tasks is typically constrained to take full advantage of all supervised signals.
Furthermore, information flows from NLU to DM, such that noisy outputs (errors) from the NLU are apt to transfer errors to the following DM, which leads to challenges for monitoring the belief distribution and predicting system actions. The most successful previous approaches cast the DM as a partially Markov decision process, which uses hand-crafted features to represent the state and action space. These existing approaches require a large number of annotated conversations or human interactions. Thus, converting these experimental methods into practice has proven far from trivial, as exact policy learning is computationally intractable. Therefore, these previous approaches are constrained to narrow domains.
Improvement in accuracy and processing speed is important for conversation understanding systems like digital personal assistants, to operate effectively across a wide variety of domains.