As devices become smaller, modes of interaction other than keyboard and stylus are a necessity. In particular, small handheld devices like cell phones and PDAs serve many functions and contain sufficient processing power to handle a variety of tasks. Present and future devices will greatly benefit from the use of multimodal access methods.
Multichannel access is the ability to access enterprise data and applications from multiple methods or channels such as a phone, laptop or PDA. For example, a user may access his or her bank account balances on the Web using an Internet browser when in the office or at home and may access the same information over a dumb phone using voice recognition and text-to-speech when on the road.
By contrast, multimodal access is the ability to combine multiple modes or channels in the same interaction or session. The methods of input include speech recognition, keyboard, touch screen, and stylus. Depending on the situation and the device, a combination of input modes will make using a small device easier. For example, in a Web browser on a PDA, you can select items by tapping or by providing spoken input. Similarly, you can use voice or stylus to enter information into a field. With multimodal technology, information on the device can be both displayed and spoken.
Multimodal applications using XHTML+Voice offer a natural migration path from today's VoiceXML-based voice applications and XHTML-based visual applications to a single application that can serve both of these environments as well as multimodal ones. A multimodal application integrates voice interface and graphical user interface interaction by setting up two channels, one for the graphical user interface and another for the voice. At the time of writing the XHTML+Voice (X+V) Profile 1.2 was published at www.voicexml.org on 16 Mar. 2004.
In a known implementation of a multimodal browser with remote voice processing a voice channel is set up between the client and the voice server and allocated to carry the voice data for the duration of the voice interaction within a X+V session. The voice channel is disconnected after the voice interaction and the X+V session continues. For each separate interaction within the X+V session a new voice channel must be set up since this avoids consuming costly voice resources on the server when the X+V session is idle.
Setting up and closing down a voice channel for each voice interaction has the disadvantage of increasing the response time of each and every voice interaction due to the time taken to open and close voice channels using present protocols (SIP and RTP). The added latency is a direct function of the network bandwidth available between the device and the server. This causes problems on low bandwidth networks such as slow internet connections and on a slow wireless network. For instance, the network bandwidth on pre-3G wireless networks is limited.