Spoken conversational interfaces for computing devices have historically been hand-scripted. This involves anticipating a specific range of utterances that users might say, and mapping these anticipated utterances to specific states or actions in a machine. Similarly, dialog aimed at clarifying ambiguous input often needs to be hand-coded. For any new application, new hand-scripting is needed. In order to localize the functionality to any new language, new hand-scripting is also needed.
Verbally spoken, conversational data is complex, with nuances including relative terms, such as “make it bigger,” ambiguous descriptions, such as “that looks good,” and oblique expressions of the users' intent, such as “the labels are too noisy” or “can we make the chart look cleaner?” Currently, there are no general mechanisms for learning grounded natural language descriptions for a verbal interface. The verbal interfaces available are limited to specific domains with particular sets of recognized commands that have been hand-scripted, and of which a user needs specific knowledge in order to interact verbally within that domain. Also, users may lack the specific domain knowledge necessary to express their goals using terminology associated with that domain.
Existing techniques for constructing verbal interfaces present a number of drawbacks that adversely affect how much data can be explored, as well as the quality of the data and the accessibility to users. Accordingly, it is desirable to have grounding between natural language and machine state change in order to create a rich interaction for verbal interfaces. What is more, systems need to recognize a user's intent regardless of exactly how the user expresses that intent in natural language.