The present invention relates to wireless communications devices and, more particularly, to methods and apparatus for determining and selection of a medium or mode by which a user interacts with an application using a wireless communication device depending on the available bandwidth over the air interface.
Many people are increasingly relying on the use of wireless communication devices to access applications executing on an application server on the World Wide Web. Examples of such applications could be voice command applications, wherein a user interacts with an application via speech input, for example weather reports, stock quotes, voice activated dialing, etc., business-related applications (such as package delivery, trucking, vending applications, etc.), and streaming media applications. Examples of wireless devices used in these types of applications include laptop computers, wireless telephones, personal digital assistants (“PDAs”), and special purpose wireless communications devices used for specific business applications.
In general, such wireless devices are able to act as wireless client devices in sessions with application servers. During such sessions, the wireless client devices receive, over an air interface, information or media content formatted for a given presentation mode. Such content or information may include voice (or other audio), text or video information. Such wireless client devices also transmit information to servers during interactive sessions, and this information may originate as non-voice (e.g., graffiti, touch, or keypad) input from users, or as speech input. Content or information in more than one type of medium, such as audio and graphics, may be referred to as multimodal interaction. As just one example, multimodal content delivered to a wireless device may comprise a combination of Voice Extensible Markup Language (VoiceXML) and Wireless Markup Language (WML) content.
The term “interaction medium”, as used herein, refers to the presentation mode or modes that define the way the user interface of a wireless device presents content or information to the user, and the mode(s) or way(s) in which the user interacts with the application. The term “interaction medium” is thus an overarching term used to define how the user interacts with the application both in terms of delivery of content from the application server, plus the medium or mode(s) in which the user provides input or interacts with the application (e.g., through text, voice, or other mode); the format of the content from the application server is a subset of the overall interaction medium/media. For example, a wireless device may have a browser function to allow content to be presented in a screen-based presentation mode, e.g., to be visually displayed on a screen, one screen at a time, in addition to audio content that may be coordinated with the display. The user may interact with the application through speech. However, the user may be in an environment where providing speech input is difficult. The user may wish to change the user input medium from speech to text, and not change the format for content delivered to the wireless device.
Presentation modes other than screen-based visual modes are also possible. For example, serving nodes can receive content written in a voice-based markup language, such as VoiceXML or Speech Application Language Tags (SALT). This content can then be interpreted and processed for presentation to users as voice-based information. Similarly, users of wireless client devices can input information or make selections in various modes, such as voice command mode (e.g., speaking commands or data) or touch (e.g., tapping a screen, typing letters and numbers).
A multi-modal interaction medium is a cohesive way of presenting information through auditory and visual channels of presentation and interface layers. Multimedia can be considered as a subset of multimodality, especially when visual elements are combined with auditory elements for presentation. Multimedia content may be broadly classified into video content (which contains visual and auditory elements) and audio (pure audio) content. Video content can be further classified into five general types, each of which require different amounts of bandwidth for acceptable and optimal user experience. These types of video content include the following types which we have defined:
Talking Head Example: A video email clip with a daughter sending a message to her mother.
Animation Example: A clip of the Simpsons animated TV show.
Low Live Action Example: A clip of Tiger Woods bouncing a golf ball on a golf club. This type of video has some still imagery from frame to frame (e.g., background) and some moving imagery (e.g., the path of the ball). Generally, in low live action, the still imagery is prevalent and thus can be encoded with fewer bits.
Medium Live Action Example: A clip of a bear and fisherman fighting over a salmon. This type of video has a higher preponderance of imagery that changes from frame to frame and thus generally requires higher frame rates and more bandwidth to render it an acceptable form to the user.
High Live Action Examples: Michael Jordan dunking to win the slam-dunk competition or a music video clip. This type of video has the highest level of action and thus requires the highest amount of bandwidth.
In certain applications where the user is also provided with the opportunity to interact with the application via speech, the different categories of multimedia can be combined with speech recognition to provide command and control of the interface through a combination of the voice user interface and a graphical user interface on the wireless device. The user may interact with the application and information may be presented to the user in several alternative interaction media, as listed below:    1) Text only: Information presented as text, e.g. News article on the UN.    2) Text and pictures: Information may be embellished with pictures, e.g. stock updates with a line graph.    3) Audio only: Information presented in an auditory format, e.g. a newsroom or news channel accessed from a voice command platform.    4) Audio and pictures: e.g., J2ME media players with a picture and audio associated with that picture or slide.    5) Audio and video: e.g., media content streamed from a media gateway, e.g. ESPN Sports Center streamed over the Internet. Furthermore, the video could be presented in any of the five types of video content listed above (talking head, animation, low live action, medium high action, and high live action). It will be understood that the five types of video content described above are non-standard definitions, and that other types of video content could be used depending on technology constraints, such as frame rate or encoding technique.    6) Audio and Text: e.g., Song plus text listing artist and song title.
Since multimodal applications can be provided to the user in these different formats or interaction media, and since each format has its own bandwidth requirements for acceptable user experience, it becomes imperative that there is enough bandwidth to provide all the different types of format in a cohesive manner. Moreover, a significant portion of the bandwidth can be used simply in having a speech channel open to facilitate speech recognition during the execution of the application. Problems can arise if the available bandwidth is less than that required for the optimal user experience with the current interaction medium. One reason for reduction or change in available bandwidth is the user moving through the wireless network and the fact that signal reception can change based on the location of the wireless device.
One way to reduce the bandwidth associated with speech recognition would be to use Distributed Speech Recognition (DSR). DSR enables recognition at the mobile terminal level and it also allows the computationally intensive recognition to be performed at a remote server, with negligible loss in performance. DSR reduces the bandwidth requirements for speech recognition and increases the performance of the speech application by improving accuracy. The first DSR standard was published by ETSI in February 2000 and an advanced standard is now being developed. The bandwidth requirement as prescribed by the standard is around 4.8 kbits/s. However, the reduction in bandwidth usage using DSR may not be enough to maintain an optimal user experience where the bandwidth over the air interface is reduced.
Further background information is discussed in the article of Mohan et al., Adapting Multimedia Internet Content for Universal Access, IEEE Transactions on Multimedia, Vol. 1, No. 1, (1999), and in the article of Andersen et al., System Support for Bandwidth Management and Content Adaptation for Internet Applications, Proceedings of the Fourth Symposium on Operating Systems Design and Implementation, OSDI 2000, Oct. 23–25, 2000, the content of both of which is incorporated by reference herein.
The present invention provides for bandwidth-based changing or selection of an appropriate interaction medium by which a user interacts with an application with a wireless device. Unlike in prior art methods, the interaction medium includes the interaction format or mode by which the user provides input to the application, e.g., through voice, text or other method. The changing of the interaction medium can be performed by the application itself, by a network entity besides the entity executing the application, by a combination of the network entity and the application, or by the user, in various embodiments of the invention. Moreover, the selection of interaction medium is applicable beyond merely situations where media is streamed to the wireless device. In particular, the present invention is highly suitable for use in situations in which there is a high degree of user input with the application (e.g., through voice or text) such as in a generic speech-based application or a speech/text and/or graffiti based multimodal application.