Traditional goal-oriented dialogue systems are intended to help users complete specific tasks, such as booking a flight or searching a database. Such conventional systems approach such tasks by interacting with users via natural language. Traditional goal-oriented dialogue systems also typically need to interact with an external database to access real-world knowledge. Previous goal-oriented dialogue systems interacted with the external database by issuing a symbolic query to the database and adding retrieved results to the dialogue state. Previous end-to-end systems constructed a symbolic query from the current belief states of the agent and retrieved results from the database that matched the query. However, such symbolic operations typically break the differentiability of the models used by traditional goal-oriented dialogue systems and prevent end-to-end gradient-based training of neural dialogue agents. Thus, existing machine learning systems have, up to now, focused on piece-wise training of end-to-end system components.
Statistical goal-oriented dialogue systems have long been modeled as partially observable Markov decision processes (POMDPs), which are trained using reinforcement learning (RL) based on user feedback.
In the last decade, goal-oriented dialogue systems (DSs) have been incorporated in various devices, with the goal being to enable users to speak to systems in order to finish tasks more efficiently. A typical goal-oriented dialogue system consists of four basic components—a language understanding (LU) module for inferring user intents and extracting associated slots, a dialogue state tracker which tracks the user goal and dialogue history, a dialogue policy which selects the next system action based on the current dialogue state, and a natural language generator (NLG) for converting dialogue acts into natural language. For successful completion of user goals, it is also necessary to equip the dialogue policy with real-world knowledge from a database. A typical pipeline of tasks in language understanding (LU) is to parse user utterances into semantic frames to capture meaning. The first task is to decide the domain given the input utterance, and based on the domain, the second task is to predict the intent, and then the third task is to fill associated slots corresponding to a domain-specific semantic template.
Such traditional approaches have several disadvantages. First, errors from previous turns are propagated to subsequent turns. Thus, earlier errors degrade performance of the current and subsequent turns. Second, knowledge mentioned in the long history is often not carried into the current turn as only a few turns are aggregated.
Improvement in accuracy and processing speed for LU is important for conversation understanding systems like digital personal assistants.