1. Technical Field
The present disclosure relates to multi-modal inputs and more specifically to touch gestures to initiate multi-modal speech recognition.
2. Introduction
Prior to multi-modal speech recognition and multi-modal interfaces, users would first select an item on a user interface, then provide verbal commands unimodally. In this way, users could only perform one task at a time, and in certain orders. Multi-modal speech recognition enhances this workflow by allowing object manipulation and speech recognition to occur in parallel and removing much, but not all, of the unimodality of the input. Multi-modal interfaces in which the user can both verbally ask, while also physically manipulating things, typically require two physical steps. First, the user initiates the speech recognition session. Second, the user physically manipulates things while talking Examples include a user asking for “Restaurants nearby” while touching a listing of a movie theater already on the screen. In such an example, the user would typically touch a listen button, start speaking, and try to quickly touch on the movie listing while speaking “Restaurants nearby here.”
In another example, the user asks “What times is this playing?” In this case, “this” is a pronoun referring to the item that was either already selected before the utterance, or selected during the utterance. Normally the user would start the recording for speech recognition, then perform a separate gesture of tapping on the item of interest while uttering a verbal query. For example, while picking a movie from a list, the user might say “What times is this playing?” or “What are the reviews for this one?” or “Add this to my plan.” These examples can be difficult, can take a significant amount of time (especially for repetitive actions), and often require some level of user training to use, as the interaction steps are not immediately intuitive for users.
Multi-modal gestures that involve combinations of touch/pen and voice require a user action that explicitly activates speech recognition to initiate and control the capture of audio. One alternative is to leave the microphone on (“open mic”), but this is not practical or desirable in mobile devices due to reasons such as privacy concerns, battery life, and ambient noise. The problem with current solutions of using a “click to speak” or “click and hold” button (either soft- or hard-wired) is that the user must take multiple steps to issue a multi-modal command, and this can lead to confusion and errors.