Computing devices containing multimodal interfaces have been proliferating. A multimodal interface as used herein refers to an interface that includes both voice processing and visual presentation capabilities. For example, numerous cellular telephones can include a graphical user interface and be capable of responding to speech commands and other speech input. Other multimodal devices can include personal data assistants, notebook computers, video telephones, teleconferencing devices, vehicle navigation devices, and the like.
Traditional methods for vocally interacting with multimodal devices typically involve first audibly or textually prompting a user for speech input. Responsive to this prompting, the device receives a requested speech input. Next, an audible or textual confirmation of the speech input can be presented to the user. Such interactions are typically slow due to the need of such methods to serially relay messages between the user and the multimodal devices. The inefficiency of these methods of prompting and confirmation can result in considerable user frustration and dissatisfaction.
Such interactions, typical of conventional systems, fail to take advantage of the capabilities of visual displays in multimodal devices to provide alternative approaches to prompt or cue multi-token speech input for speech recognition purposes. Accordingly, there is a need for systems and methods utilized with multimodal devices that enable such devices to use the capabilities of visual interfaces advantageously to provide simple, efficient, and accurate mechanisms for prompting and cueing user to provide multi-token input speech.