1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, systems, and products for speech-enabled content navigation and control of a distributed multimodal browser.
2. Description of Related Art
User interaction with applications running on small devices through a keyboard or stylus has become increasingly limited and cumbersome as those devices have become increasingly smaller. In particular, small handheld devices like mobile phones and PDAs serve many functions and contain sufficient processing power to support user interaction through multimodal access, that is, by interaction in non-voice modes as well as voice mode. Devices which support multimodal access combine multiple user input modes or channels in the same interaction allowing a user to interact with the applications on the device simultaneously through multiple input modes or channels. The methods of input include speech recognition, keyboard, touch screen, stylus, mouse, handwriting, and others. Multimodal input often makes using a small device easier.
Multimodal applications are often formed by sets of markup documents served up by web servers for display on multimodal browsers. A ‘multimodal browser,’ as the term is used in this specification, generally means a web browser capable of receiving multimodal input and interacting with users with multimodal output, where modes of the multimodal input and output include at least a speech mode. A multimodal browser typically includes a user agent for each mode of user interaction provided by the multimodal browser. Each user agent provides the functionality for interacting with a user in a particular modality. For example, a graphical user agent of a multimodal browser may provide the functionality for interacting with a user through a graphical user interface (‘GUI’) by processing user input through GUI elements and displaying output on the GUI. A voice user agent of a multimodal browser may provide the functionality for interacting with a user through a voice user interface by recognizing speech input and synthesizing speech output. Because the visual mode of user interaction has historically been the dominate mode of user interaction, the graphical user agent of a multimodal browser typically coordinates the user interaction among all the user agent providing a multimodal experience to a user.
Multimodal browsers typically render web pages written in XHTML+Voice (‘X+V’). X+V provides a markup language that enables users to interact with an multimodal application often running on a server through spoken dialog in addition to traditional means of input such as keyboard strokes and mouse pointer action. Visual markup tells a multimodal browser what the user interface is look like and how it is to behave when the user types, points, or clicks. Similarly, voice markup tells a multimodal browser what to do when the user speaks to it. The multimodal browser processes visual markup with a graphical user agent and processes voice markup with a voice user agent. X+V adds spoken interaction to standard web content by integrating XHTML (eXtensible Hypertext Markup Language) and speech recognition vocabularies supported by VoiceXML. For visual markup, X+V includes the XHTML standard. For voice markup, X+V includes a subset of VoiceXML. For synchronizing the VoiceXML elements with corresponding visual interface elements, X+V uses events. XHTML includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific events. Voice interaction features are integrated with XHTML and can consequently be used directly within XHTML content.
In addition to X+V, multimodal applications also may be implemented with Speech Application Tags (‘SALT’). SALT is a markup language developed by the Salt Forum. Both X+V and SALT are markup languages for creating applications that use voice input/speech recognition and voice output/speech synthesis. Both SALT applications and X+V applications use underlying speech recognition and synthesis technologies or ‘speech engines’ to do the work of recognizing and generating human speech. As markup languages, both X+V and SALT provide markup-based programming environments for using speech engines in an application's user interface. Both languages have language elements, markup tags, that specify what the speech-recognition engine should listen for and what the synthesis engine should ‘say.’ Whereas X+V combines XHTML, VoiceXML, and the XML Events standard to create multimodal applications, SALT does not provide a standard visual markup language or eventing model. Rather, it is a low-level set of tags for specifying voice interaction that can be embedded into other environments. In addition to X+V and SALT, multimodal applications may be implemented in Java with a Java speech framework, in C++, for example, and with other technologies and in other environments as well.
Multimodal browsers may be categorized generally as local browsers or distributed browsers. A local multimodal browser is a multimodal browser for which all user agents operate on the same computer. For example, in a local multimodal browser having a graphical user agent and a voice user agent, the functionality for processing both visual markup and voice markup of a multimodal application is provided by same computing device. A distributed multimodal browser is a multimodal browser for which the user agents operate on at least two computers. For example, in a distributed multimodal browser having a graphical user agent and a voice user agent, the functionality for processing both visual markup and voice markup of a multimodal application is provided by two separate computing devices. Distributed multimodal browsers are often utilized on small multimodal devices because such devices typically do not have the computer resources needed to run both a graphical user agent and a visual user agent simultaneously.
As mentioned above, a multimodal browser typically provides speech-enabled user interaction. Such speech enablement is typically organized into two categories. The first category is speech-enabling the content of a multimodal application. Speech-enabling the content of a multimodal application may include, for example, synthesizing the text of an X+V page for playback through a speaker of a multimodal device. The second category is speech-enabling content navigation and control of a multimodal browser. Speech-enabling content navigation and control of a multimodal browser may include, for example, allowing a user to navigate the links of an X+V page using voice commands. Speech-enabling content navigation and control of a multimodal browser may also include, for example, allowing a user to open a new window or tab in a multimodal browser using voice commands.
Local multimodal browsers are routinely able to perform both categories of speech-enablement because the voice user agent is available locally to provide voice services such as speech recognition and speech synthesis to the graphical user agent. The ability to perform both categories of speech-enablement is not, however, provided by current distributed multimodal browsers. Distributed multimodal browsers may speech-enable the content of a multimodal application using standard protocols developed for the operation of a distributed multimodal browser across a network such as, for example, the protocols specified by the Open Mobile Alliance and by the Internet Engineering Task Force. Distributed multimodal browsers, however, typically cannot perform speech-enabled content navigation and control because the information needed to speech-enable the interface provided by a graphical user agent is not known, a priori, by a voice user agent. Because current protocols and distributed multimodal browsers do not address this aspect of speech-enablement, readers will appreciate that room for improvement exists to speech-enable content navigation and control of a distributed multimodal browser.