A visually-oriented display device and a keyboard (or keypad) is a familiar approach for a user to provide input/output on computers and other devices. To this visual modality, a voice modality can be added, for example including speech recognition for input and audio playback for output. The result is a multimodal user interface, which can give the user different modes for interacting with the user's device.
Providing multimodal applications on a mobile communication device such as a cellular telephone can be particularly compelling. For example, in some (but not all) instances it is more difficult to operate a mobile communication device using the hands to manipulate the user interface, than it is to speak to the mobile communication device.
As an example, consider a user making flight reservations. Using a multimodal application, the user can speak into a communication device to make flight reservations, and the user can receive both a voice response and a graphical confirmation of the reservation. Another example is where a user requests a map, and is provided a map. Another example application is a voice request for a stock quote, where a graph is provided in return. The level of interaction is more complex than merely typing in a request and receiving an e-mail confirmation or speaking a request to an automated attendant and simply receiving a voice response.
The devices for which multimodal applications can be particularly desirable are mobile and therefore tend to have limited processing and storage power available. A thin client distributed multimodal architecture may be employed on such a resource-constrained device (e.g., a mobile cellular telephone) with audio capture/playback and visual input/output, where the device interfaces with a network based speech recognition server, typically a standard VoiceXML (extensible markup language) platform. Speech recognition is the process of converting spoken words to computer-intelligible information.
There are, unfortunately, no agreed-upon standards for multimodal applications. Hence, multimodal applications tend to use proprietary standards to communicate between the device and the speech recognition server. In addition, there is currently no standard protocol for synchronizing voice and visual presentation over the network. Implementations to date have introduced new custom protocols that involve non-standard modifications to the VoiceXML platform. Not only do non-standard protocols introduce interoperability issues, but also they raise security issues because network operators may need to open their firewalls to allow the new protocol to pass through.