Multimodal interaction allows users to dynamically select the most appropriate mode of interaction for their current needs. When processing a multimodal application (e.g., browsing a multimodal web site), depending upon the device operated by a user, he or she can provide input to the application via speech, handwriting, keystrokes, or other input modalities, with output presented via displays, pre-recorded and synthetic speech, audio, and/or tactile mechanisms such as mobile phone vibrators and Braille strips.
FIG. 1 illustrates the basic components of a World Wide Web Consortium (W3C) multimodal interaction framework, which is described more fully in W3C Multimodal Interaction Framework, W3C NOTE 6 May 2003 available at http://www.w3.org/TR/mmi-framework/ and incorporated herein by reference in its entirety. As shown, a user 120 enters input into a multimodal system via an input component 125 using, for example, speech, handwriting, or keystrokes, and observes and/or hears output presented by the system via an output component 127, in the form of speech, text, graphics, audio files or animation.
An interaction manager 130 is responsible for coordinating data and managing execution flow from the various input and output modality components, as well as an Application Function 140, a Session Component 150 and a System and Environment Component 160. The Application Functions 140 provide the multimodal applications. The session component 150 provides an interface to the interaction manager to support state management and temporary and persistent sessions for multimodal applications, and the System and Environment component 160 enables the interaction manager to find out about and respond to changes in device capabilities, user preferences and environment conditions.
In general, sub-components of the input component 125 include a recognition component, an interpretation component and an integration component. The recognition component captures the natural input of the user and translates that input into a form useful for later processing. For example, where the input mode is handwriting, the recognition component will convert the user's handwritten symbols and messages into text by using, for example, a handwritten gesture model, a language model and a grammar. The interpretation component further processes the results of the recognition component by identifying the “meaning” or “semantics” intended by the user. Finally, the integration component combines the output from several interpretation components.
Similarly, sub-components of the output component 127 include a generation component, a styling component and a rendering component. The generation component determines which output mode(s) will be used for presenting information to the user. The styling component adds information about how the information is presented or “layed out,” and the rendering component converts the output of the styling component into a format that the user can easily understand. Any one or all of the sub-components of the input and output components 125, 127 may reside on the user's device, or on a network, such as a Local Area Network (LAN), Wide Area Network (WAN), or the Internet. The user's device may include any computing device including, for example, a mobile telephone, personal data assistant (PDA), laptop or mobile personal computer (PC), desktop unit, or workstation.
What if a user has more than one device, each having limited, but different, modality capabilities? For instance, he or she may have a PDA that acts as a vision browser able to receive input in the form of handwriting or keystrokes, and to produce output in the form of text or graphics, but is not capable of receiving or producing speech input or output. However, the user has a separate device, such as a mobile telephone or an electronic amulet the user can wear around his or her neck, that is capable of processing an application using speech inputs and outputs. It would be desirable for the user to be able to use both (or all) of his or her devices when processing a multimodal application in order to have composite capabilities exceeding the capabilities of any one of the devices used on their own. A need therefore exists for a means for enabling a user to process a multimodal application using multiple devices having varying capabilities.