1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, systems, and products for synchronizing visual and speech events in a multimodal application.
2. Description of Related Art
User interaction with applications running on small devices through a keyboard or stylus has become increasingly limited and cumbersome as those devices have become increasingly smaller. In particular, small handheld devices like mobile phones and PDAs serve many functions and contain sufficient processing power to support user interaction through other modes, such as multimodal access. Devices which support multimodal access combine multiple user input modes or channels in the same interaction allowing a user to interact with the applications on the device simultaneously through multiple input modes or channels. The methods of input include speech recognition, keyboard, touch screen, stylus, mouse, handwriting, and others. Multimodal input often makes using a small device easier.
Multimodal applications often run on servers that serve up multimodal web pages for display on a multimodal browser. A ‘multimodal browser,’ as the term is used in this specification, generally means a web browser capable of receiving multimodal input and interacting with users with multimodal output. Multimodal browsers typically render web pages written in XHTML+Voice (X+V). X+V provides a markup language that enables users to interact with an multimodal application often running on a server through spoken dialog in addition to traditional means of input such as keyboard strokes and mouse pointer action. X+V adds spoken interaction to standard web content by integrating XHTML (extensible Hypertext Markup Language) and speech recognition vocabularies supported by Voice XML. For visual markup, X+V includes the XHTML standard. For voice markup, X+V includes a subset of VoiceXML. For synchronizing the VoiceXML elements with corresponding visual interface elements, X+V uses events. XHTML includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific events. Voice interaction features are integrated with XHTML and can consequently be used directly within XHTML content.
The top-level VocieXML element is <vxml>, which is a container for dialogs. There are two kinds of dialogs: forms and menus. Voice forms define an interaction that collects values for a set of form item variables. Each form item variable of a voice form may specify a grammar that defines the allowable inputs for that form item. If a form-level grammar is present, it can be used to fill several form items from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.
Forms are interpreted by a form interpretation algorithm (FIA). An FIA typically includes a main loop that repeatedly selects form items collects user input and identifies any actions to be taken in response to input items. Interpreting a voice form item typically includes selecting and playing one or more voice prompts, collecting user input, either a response that fills in one or more input items, or a throwing of some event (a help even, for example), and interpreting any actions that pertained to the newly filled in input items.
To synchronize the receipt of spoken information and visual elements, X+V provides a <sync> element. The <sync> element synchronizes data entered through various multimodal input. That is, the <sync> element synchronizes accepted speech commands received in the multimodal browser with visual elements displayed in the multimodal browser. <Sync> synchronizes the value property of an XHTML input control with a VoiceXML field in a one to one manner. <Sync> does not activate a voice handler and therefore does not allow for the identification and execution of further additional functions in response to a particular speech command. There is therefore an ongoing need for improvement in synchronizing visual and speech events in a multimodal application that allows for execution of multiple application functions in response to a speech command received in a voice form or voice menu.