Humans may engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “chat bots,” “interactive personal assistants,” “intelligent personal assistants,” “personal voice assistants,” “conversational agents,” etc.). For example, humans (which when they interact with automated assistants may be referred to as “users”) may provide commands, queries, and/or requests using spoken natural language input (i.e. utterances) which may in some cases be converted into text and then processed, and/or by providing textual (e.g., typed) natural language input.
Users may engage in human-to-computer dialog sessions with automated assistants to complete a variety of tasks such as finding a restaurant or booking movie tickets through natural language interactions. A “dialog manager,” also referred to as dialog management policy, is often the decision-making component of an automated assistant, and may choose responsive actions at each step or “turn” to guide the human-to-computer dialog session to successful task completion. The responsive actions may include interacting with the user to, for instance, obtain specific requirements for accomplishing the task (i.e. slot filling), as well as negotiating and offering alternatives.
Reinforcement learning has been used successfully in a variety of discrete action domains. For example, reinforcement learning has been used to train decision-making models, or “playing agents,” to play a variety of relatively simple video games. While efforts have been made to utilize reinforcement learning to train dialog managers employed by automated assistants, dialog management presents different challenges than video games. A playing agent for playing a video game may only be trained on a handful of available moves, such as up, down, left, right, etc. A video game episode may be relatively large in depth, i.e. may include a relatively large number of discrete steps (e.g., moves), but each individual step makes a relatively small change in the state of the environment. By contrast, dialog management, and particularly task-oriented dialog, usually only consists of a relatively small number of discrete steps (referred to as “turns” herein), and actions performed during each turn can dramatically alter a state of the dialog session. Consequently, mistakes by a dialog manager are both costlier and more temporally localized compared to mistakes in other reinforcement learning domains.
In some respects, dialog management is similar to strategy games (e.g., Go, chess), which require long term planning because each individual move has a relatively large impact in the game state. Successful efforts have been made to train playing agents to play such games using aspects of reinforcement learning. However, dialog management is different than strategy games because it is essentially an asymmetric, imperfect information game with no predefined rules. Consequently, it is difficult to collect large, high quality data sets of dialogs with expert human agents and real users for every kind of task and/or user behavior that the dialog system may be expected to handle. And because dialog management is asymmetric, it is not straightforward to apply self-play to exhaustively explore the game tree. Additionally, the flexibility of human conversations and lack of precise models of user goals and behavior make it laborious to engineer a realistic user simulator. Moreover, uncertainty over a user's goals and strict latency expectations for a real-time dialog agent make it difficult to leverage Monte Carlo tree search rollouts at inference time.