1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for establishing a multimodal personality for a multimodal application.
2. Description of Related Art
User interaction with applications running on small devices through a keyboard or stylus has become increasingly limited and cumbersome as those devices have become increasingly smaller. In particular, small handheld devices like mobile phones and PDAs serve many functions and contain sufficient processing power to support user interaction through other modes, such as multimodal access. Devices which support multimodal access combine multiple user input modes or channels in the same interaction allowing a user to interact with the applications on the device simultaneously through multiple input modes or channels. The methods of input include speech recognition, keyboard, touch screen, stylus, mouse, handwriting, and others. Multimodal input often makes using a small device easier.
Multimodal applications often run on servers that serve up multimodal web pages for display on a multimodal browser. A ‘multimodal browser,’ as the term is used in this specification, generally means a web browser capable of receiving multimodal input and interacting with users with multimodal output. Multimodal browsers typically render web pages written in XHTML+Voice (‘X+V’). X+V provides a markup language that enables users to interact with an multimodal application often running on a server through spoken dialog in addition to traditional means of input such as keyboard strokes and mouse pointer action. Visual markup tells a multimodal browser what the user interface is to took like and how the user interface is to behave when the user types, points, or clicks. Similarly, voice markup tells a multimodal browser what to do when the user speaks to it. For visual markup, the multimodal browser uses a graphics engine; for voice markup, the multimodal browser uses a speech engine. X+V adds spoken interaction to standard web content by integrating XHTML (eXtensible Hypertext Markup Language) and speech recognition vocabularies supported by VoiceXML. For visual markup, X+V includes the XHTML standard. For voice markup, X+V includes a subset of VoiceXML. For synchronizing the VoiceXML elements with corresponding visual interface elements, X+V uses events. XHTML includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific events. Voice interaction features are integrated with XHTML and can consequently be used directly within XHTML content.
In addition to X+V, multimodal applications also may be implemented with Speech Application Tags (‘SALT’). SALT is a markup language developed by the Salt Forum. Both X+V and SALT are markup languages for creating applications that use voice input/speech recognition and voice output/speech synthesis. Both SALT applications and X+V applications use underlying speech recognition and synthesis technologies or ‘speech engines’ to do the work of recognizing and generating human speech. As markup languages, both X+V and SALT provide markup-based programming environments for using speech engines in an application's user interface. Both languages have language elements, markup tags, that specify what the speech-recognition engine should listen for and what the synthesis engine should ‘say.’ Whereas X+V combines XHTML, VoiceXML, and the XML Events standard to create multimodal applications, SALT does not provide a standard visual markup language or eventing model. Rather, it is a low-level set of tags for specifying voice interaction that can be embedded into other environments. In addition to X+V and SALT, multimodal applications may be implemented in Java with a Java speech framework, in C++, for example, and with other technologies and in other environments as well.
Current lightweight voice solutions require a developer to build a grammar and lexicon to limit the potential number of words that an automated speech recognition (‘ASR’) engine must recognize—as a means for increasing accuracy. Pervasive devices have limited interaction and input modalities due to the form factor of the device, and kiosk devices have limited interaction and input modalities by design. In both cases the use of speaker independent voice recognition is implemented to enhance the user experience and interaction with the device. The state of the art in speaker independent recognition allows for some sophisticated voice applications to be written as long as there is a limited vocabulary associated with each potential voice command. For example, if the user is prompted to speak the name of a city the system can, with a decent level of confidence, recognize the name of the city spoken. In the case where there is no explicit context, such as a blank text field for inputting any search query, this speaker independent recognition fails because a reasonably sized vocabulary is not available.
Incorporating speech into multimodal application, however, naturally leads users to expect or at least wish that the multimodal application would have some personality. Personality is characterized by dynamism, however, and in the current state of the art, the user interface, page after page, voice after voice, is static. Despite providing additional modes for user interaction, web applications today do not dynamically adjust to meet the user's rising expectation of speed and interaction quality.