Applications that use automatic speech recognition (ASR) require a speech-based user interface to interact with users. Generally, users can perform two types of tasks using spoken user input. The first task type relates to information retrieval (IR) with a query. In this tasks, the user wishes to retrieve an item, e.g., a document, image, recording, from a large collection of items stored in a database, e.g., the web of the Internet. The other task type is for speech enabled command and control. Here, the user wishes to perform some operation. Both tasks involve a “narrowing down” of the possibilities of what the user might have said.
In the case of IR, this is often accomplished through dialogs as shown in FIG. 1, where the vertical axis indicates time. In FIG. 1, the riser 101 steps are shown in the left, and the system 102 steps on the right. The system has some root state R 120. The user to 101 provides spoken input 110, e.g. to search for an item. The spoken input 110 is interpreted 122 as relevant to state set X 124, rather than as relevant to some other state sets Y and Z 123. In response the system enters a next state X0 125, and perhaps, prompts the user.
The user provides additional input 110. For example, in a voice-based destination entry system, the user might first be required to select a country, and then, in a separate step, a city, before being allowed to say a destination street name. The process 124 iterates and continues with the system changing 126 states 128-129, until the interaction is complete and the relevant item 127 is retrieved.
Typically, every system state has a limited, state-specific grammar, vocabulary, and/or language model, and states such as 128-129 are reachable only via a multi-step process involving the traversal of two or more application states in a finite-state machine (FSM).
As shown in FIG. 2, a command-oriented approach often involve “carrier” phrases, wherein command words are present in the same phrases 130 as certain modifier words and/or variables. The system interprets 122 the meaning of the carrier phrase given as modifiers and variables within the phrase 130 and enters state set X 124. If the carrier phrase is relevant to for some example state X1 129, the system may either immediately enter that state, or request confirmation 132 from the user before entering that state. Confirmation or cancellation 137 on the user's part 133 could be accomplished using verbal or physical interaction modalities 139. The process 124 can iterate as before.
Other approaches are also common. For example, a variable can be spoken without a command, or a command can initiate a dialog state in which only the variables are available. Search tasks can also be accomplished using carrier words, such as in the phrase “find artist Vanilla Ice.” In each case, however, the vocabularies, phrase grammars and/or language models for each state are fixed.
In general, the two different interfaces are incompatible with each other. That is, an IR interface cannot process commands, and a control interface cannot process queries.