1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for disambiguating a speech recognition grammar in a multimodal application.
2. Description of Related Art
User interaction with applications running on small devices through a keyboard or stylus has become increasingly limited and cumbersome as those devices have become increasingly smaller. In particular, small handheld devices like mobile phones and PDAs serve many functions and contain sufficient processing power to support user interaction through multimodal access, that is, by interaction in non-voice modes as well as voice mode. Devices which support multimodal access combine multiple user input modes or channels in the same interaction allowing a user to interact with the applications on the device simultaneously through multiple input modes or channels. The methods of input include speech recognition, keyboard, touch screen, stylus, mouse, handwriting, and others. Multimodal input often makes using a small device easier.
Multimodal applications are often formed by sets of markup documents served up by web servers for display on multimodal browsers. A ‘multimodal browser,’ as the term is used in this specification, generally means a web browser capable of receiving multimodal input and interacting with users with multimodal output, where modes of the multimodal input and output include at least a speech mode. Multimodal browsers typically render web pages written in XHTML+Voice (‘X+V’). X+V is described in the W3C specification entitled “XHTML+Voice Profile 1.2” of Mar. 16, 2004, as follows:                The XHTML+Voice profile brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content.        
X+V provides a markup language that enables users to interact with an multimodal application through spoken dialog in addition to traditional means of input such as keyboard strokes and mouse pointer action. Visual markup tells a multimodal browser what the user interface is look like and how it is to behave when the user types, points, or clicks. Similarly, voice markup tells a multimodal browser what to do when the user speaks to it. For visual markup, the multimodal browser uses a graphics engine; for voice markup, the multimodal browser uses a speech engine. X+V adds spoken interaction to standard web content by integrating XHTML (eXtensible Hypertext Markup Language) and speech recognition vocabularies supported by VoiceXML. For visual markup, X+V includes the XHTML standard. For voice markup, X+V includes a subset of VoiceXML. For synchronizing the VoiceXML elements with corresponding visual interface elements, X+V uses events.
XHTML includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific events. Voice interaction features are integrated with XHTML and can consequently be used directly within XHTML content.
In addition to X+V, multimodal applications also may be implemented with Speech Application Tags (‘SALT’). SALT is a markup language developed by the Salt Forum. Both X+V and SALT are markup languages for creating applications that use voice input/speech recognition and voice output/speech synthesis. Both SALT applications and X+V applications use underlying speech recognition and synthesis technologies or ‘speech engines’ to do the work of recognizing and generating human speech. As markup languages, both X+V and SALT provide markup-based programming environments for using speech engines in an application's user interface. Both languages have language elements, markup tags, that specify what the speech-recognition engine should listen for and what the synthesis engine should ‘say.’ Whereas X+V combines XHTML, VoiceXML, and the XML Events standard to create multimodal applications, SALT does not provide a standard visual markup language or eventing model. Rather, it is a low-level set of tags for specifying voice interaction that can be embedded into other environments. In addition to X+V and SALT, multimodal applications may be implemented in Java with a Java speech framework, in C++, for example, and with other technologies and in other environments as well.
Current lightweight voice solutions require a developer to build a grammar and lexicon to limit the potential number of words that an automated speech recognition (‘ASR’) engine must recognize—as a means for increasing accuracy. Pervasive devices have limited interaction and input modalities due to the form factor of the device, and kiosk devices have limited interaction and input modalities by design. In both cases the use of speaker independent voice recognition is implemented to enhance the user experience and interaction with the device. The state of the art in speaker independent recognition allows for some sophisticated voice applications to be written as long as there is a limited vocabulary associated with each potential voice command. For example, if the user is prompted to speak the name of a city the system can, with a good level of confidence, recognize the name of the city spoken.
When using a multimodal browser on a multimodal device with a display screen to access dynamic search results on the web, a user often is presented with a list of results which may contain duplicate names. For example, when searching for book stores in Seattle, a user may be presented with multiple locations of a chain book stores, Borders, Barnes & Noble, and so on. On a device with a limited screen size only a fraction of the results may be visible at a time. A search book stores in New York City, for example, may yield a list containing links to several Barnes & Noble book stores.
Speech recognition grammars to voice enable the display of such search results may be dynamically generated by a web server or a browser. Grammars generated from the results being presented on the screen may produce ambiguous results when correlating the matched utterance to the data the user sees on the display. In the Barnes & Noble example, the utterance “barnes and noble” may match grammar elements that voice enable all the links to Barnes & Noble stores in the search results. If grammars are generated in the order the data appears on the display, a prior art speech engine will match the last in a set of duplicates because of the search algorithm in the automatic speech recognition engine—searching from a leaf node up the branches in a grammar tree, for example. If the display has not been scrolled, then the last link to Barnes & Noble is not visible on the display, and the link activated in response to the utterance is not the one visible to user—and therefore unlikely to be the one that the user thought was invoked by the utterance. When this unintended, ambiguous link is invoked, the user confusingly finds that the user is looking at information on some Barnes & Noble store other than the one intended.