1. Technical Field
The present disclosure relates to human-computer interaction and more specifically to incorporating a continuous speech input stream and a continuous gesture input stream.
2. Introduction
Currently deployed multimodal interfaces, such as systems that support user inputs combining speech and gesture, typically involve direct contact with a screen through touch or pen inputs. For example, the Speak4it application on the iPhone enables users to combine speech inputs with hand-drawn gestures to issue commands. Multimodal interfaces can have applications in other contexts where it is not practical or desirable to touch the screen. These include large screen displays in the living room, displays in medical applications, and smart office environments. In the living room example, users typically interact with content and applications using an overly complicated remote control and complex onscreen menus navigated using arrow keys.
One set of improvements use an infrared camera to track the direction in which an infrared remote control is pointing, enabling users to draw and make gestures on the screen at a distance. Similarly, handheld controllers such as the Wii remote can be used to point at and manipulate what is on the screen. Previous work has also explored adding speech to the remote control for media search and related tasks. In each of these approaches the user has to hold a remote or some other device in order to interact, and often must provide some explicit input, such as a button press, touching a stylus to a display or other pressure-sensitive surface, or uttering a key phrase, to signal to the system to pay attention to and process the input being provided. This approach is cumbersome and requires additional effort (i.e. holding a device and remembering to activate the device at the appropriate time) for handling multimodal inputs.