Handwriting, gesturing, drawing, facial expressions, body/facial orientation (e.g., vision) and speech are typical communication modes between human beings. Various computing devices (e.g., a camera/display device for recording and playing a message, a mobile phone for calling another phone, an email system for communicating electronic mail and/or the like) provide a user with different interface systems for interacting with other humans in one or more of these modes. Some interface systems may support multiple communication modes (e.g., multimodal), such as a mobile phone capable of sending voice and video data simultaneously.
A common interface system combines a visual modality mechanism (e.g. a display for output and a keyboard and mouse for input) with a voice modality mechanism (e.g., speech recognition for input and a speaker for output), but other modality mechanisms, such as pen-based input, also may be supported. These interface systems limit the interaction between humans to a certain number of mechanisms for providing an input modality and receiving an output modality. In order to send an e-mail, for example, the user needs to use the voice modality mechanism and/or the visual modality mechanism.
An increasing number of existing computing device interface systems are able to support other communication modes and additional interaction mechanisms. The existing computing device interface systems, for instance, may be implemented in contemporary gaming consoles in order to detect user movements and interpret these movements as game input. For example, the Kinect™ for MICROSOFT Xbox 360® uses video (i.e., camera) and audio (i.e., a voice recorder) technology to sense the user movements without the need for a controller.
While motion detection and multimodal interface systems are well-known concepts in human-computer interaction (HCl), current research and technology have numerous shortcomings. For example, existing interface systems continue to inaccurately interpret the user intent with respect to complex input, such as facial expressions, gestures and/or speech.