1. Technical Field
The present disclosure relates to spoken dialog systems and more specifically to combining manual design of spoken dialog systems with an automatic learning approach.
2. Introduction
The development of interactive computer systems is expensive and time-consuming. Further, the user interface to such systems poses a significant challenge. Despite years of research, speech recognition technology is far from perfect, and speech recognition errors remain a central problem for the user interface. Misunderstanding the user's speech causes the system to get off track and often leads to failed dialogs.
Two approaches are commonly used for generating spoken dialog systems, the conventional approach and the automatic learning approach. The conventional or manual design approach is often used in commercial or industrial settings. Such commercial systems have a manually designed computer program controlling the flow of the conversation. A dialog designer can tailor all the prompts to say exactly what she wants. Because a computer program controls the dialog flow, a designer can modify the computer program to encode business rules. Some examples of business rules include always confirm money transfer with a yes/no question and never display account info unless the corresponding user account is verified. A dialog designer must generate detailed flow charts outlining the possible branches in the conversation. These flow charts can be incredibly large and complicated (i.e. hundreds of Microsoft Visio pages) because conversations are temporal. At every point, the person can say something different, so the tree is complicated with lots of branches and loops. A designer typically ignores a lot of state information, history, and dialog details to simplify these complicated trees. As such, manually designed systems are not very robust to speech recognition errors.
The automatic learning approach uses machine learning and optimization to design the dialog system. Instead of specifying when the system should take a certain action as in the conventional approach set forth above, the system selects an action from a palette of possible actions. For example, in an airline dialog system, the system can say “Where do you want to fly from?”, “Where do you want to fly to?”, “OK, you want to fly to Phoenix.”, confirm the date or flight class, print a ticket, etc. The optimization procedure is unconstrained regarding the order or dependencies between variables and may take any action at any time. The automatic learning approach interacts with a user simulation and employs reinforcement learning to try out all the different sequences of actions in order to come up with a dialog plan. This approach still requires a lot of work, but the dialog plan is more robust and detailed. The dialog system is not bounded by what the designer can hold in her head or express in numerous Visio pages. The dialog becomes an optimization problem that a computer can solve with as much detail as desired.
However, both of these approaches have shortcomings. Automatic learning does not provide a good way to express business rules in this context because the system can take any of the actions at any time. The automatic learning approach also encounters difficulty knowing how to tailor prompts appropriately. For example, the system knows that there is a certain way of asking “where are you flying to”, or on what date, but has a hard time knowing how and when to say things like, “Oh, sorry. Where do you want to fly from?” Those joining words and phrases and intonations designed to elicit just the right response from users are difficult to generate in this approach because the system just knows the general form of the question but not how to tailor that question to different situations. Accordingly, what is needed in the art is an improved way to blend the strengths of the conventional and automatic learning approaches while minimizing their shortcomings.