1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for providing oral modification of an automatic speech recognition (‘ASR’) lexicon of an ASR engine.
2. Description of Related Art
User interaction with applications running on small devices through a keyboard or stylus has become increasingly limited and cumbersome as those devices have become increasingly smaller. In particular, small handheld devices like mobile phones and PDAs serve many functions and contain sufficient processing power to support user interaction through other modes, such as multimodal access. Devices which support multimodal access combine multiple user input modes or channels in the same interaction allowing a user to interact with the applications on the device simultaneously through multiple input modes or channels. The methods of input include speech recognition, keyboard, touch screen, stylus, mouse, handwriting, and others. Multimodal input often makes using a small device easier.
Multimodal applications often run on servers that serve up multimodal web pages for display on a multimodal browser. A ‘multimodal browser,’ as the term is used in this specification, generally means a web browser capable of receiving multimodal input and interacting with users with multimodal output. Multimodal browsers typically render web pages written in XHTML+Voice (‘X+V’). X+V provides a markup language that enables users to interact with an multimodal application often running on a server through spoken dialog in addition to traditional means of input such as keyboard strokes and mouse pointer action. X+V adds spoken interaction to standard web content by integrating XHTML (eXtensible Hypertext Markup Language) and speech recognition vocabularies supported by VoiceXML. For visual markup, X+V includes the XHTML standard. For voice markup, X+V includes a subset of VoiceXML. For synchronizing the VoiceXML elements with corresponding visual interface elements, X+V uses events. XHTML includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific events. Voice interaction features are integrated with XHTML and can consequently be used directly within XHTML content.
Current lightweight voice solutions require a developer to build a grammar and lexicon to limit the potential number of words that an ASR engine must recognize—as a means for increasing accuracy. This approach is naturally limiting because one or more elements of the grammar may not be properly accounted for in the lexicon.
The current state of the art allows for correction of improperly recognized words given a graphical user interface (‘GUI’), for example in fairly mature versions of IBM ViaVoice, Dragon NaturallySpeaking, and L&H VoiceXpress, starting around 1998. If a user dictated something like “Call Martha tomorrow” and the system recognized “Call Marsha tomorrow”, then the user could correct Marsha by double-clicking Marsha on the GUI and selecting an alternative from a pop-up list. The user could select Marsha orally with “Select Marsha” or “Correct Marsha”. If Marsha was out of grammar, the user could command the system to allow the user to voice spell ‘Marsha’ and add it to the grammar, all of which in prior art systems required a complex sequence of GUI commands or voice commands.
It is desirable in voice-enabled systems for the oral interactions to approximate human conversation. Where the user is required to select the improperly recognized word through GUI operations, there is no analog for correcting the error solely by interacting orally. Where the user is provided voice commands for correction, current state of the art presents an interaction model that is more segmented than would be expected of normal human conversation.