1. Field of the Invention
The present invention relates to wireless communications devices and, more particularly, to optimizing delivery of multimodal content.
2. General Background
Many people are increasingly relying on the Worldwide Web to obtain information. In addition to laptop and desktop computers, many wireless devices, such as wireless telephones and PDAs, can now be used to access the Worldwide Web. In general, such wireless devices are able to act as wireless client devices in sessions with application servers. During such sessions, the wireless client devices receive, over an air interface, content formatted for a given presentation mode. Such content may include voice, text, or graphic information. Such wireless client devices also transmit information to servers during interactive sessions, and this information may originate as voice or non-voice (graffiti, touch input, or keypad input, for example) input from users. Content that contains information in more than one format, such as audio and graphics, may be referred to as multimodal content.
Presentation mode refers to the way a user interface of the wireless device presents the multimodal content to the user. For example, a wireless device may have a browser function to allow content to be presented in a screen-based presentation mode, e.g., to be displayed on a screen one screen at a time. Content that can be provided to browsers built into small devices (that is, mini or micro browsers) is often written to render a special markup language, such as the Wireless Markup Language (WML), Handheld Device Markup Language (HDML), or eXtensible HyperText Markup Language (XHTML). These markup languages facilitate interaction on the smaller screens and specialized browsers that handheld wireless devices typically use.
Presentation modes other than screen-based visual modes are also possible. For example, serving nodes can receive content written in a voice-based markup language, such as Voice Extensible Markup Language (VoiceXML) or Speech Application Language Tags (SALT). This content can then be interpreted and processed for presentation to users as voice-based information. Similarly, users of wireless client devices can input information or make selections in various modes, such as voice (e.g., speaking commands or data) or touch (e.g., tapping a screen, typing letters and numbers).
Some wireless devices and systems are multimodal, meaning they are able to present content and receive user input in more than one mode. Wireless systems can support both sequential and simultaneous multimodality. Sequential multimodality permits seamless switching between visual and voice modes. Simultaneous multimodality permits visual and voice mode to be active at the same time.
User experience is enhanced when content can be delivered quickly to handheld devices, but fast response time can be difficult to achieve with large applications. For an acceptable user experience, delivery of voice content should be nearly instantaneous; streaming of audio content requires an infrastructure to support streaming and has its own drawbacks. Regardless of the audio content, a multimodal document will contain code that will be part of the document links to audio files and grammar files. If the grammar files have to be downloaded every time the document is presented, there will be latency due to bandwidth restrictions and processing these requests. Caching content can dramatically improve delivery speed for large applications and web objects, but caching any and all content, in all possible presentation and input modes (modalities) that a person might use can be impractical due to the large amount of memory that would be required.