1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for enabling grammars in web page frames.
2. Description of Related Art
User interaction with applications running on small devices through a keyboard or stylus has become increasingly limited and cumbersome as those devices have become increasingly smaller. In particular, small handheld devices like mobile phones and PDAs serve many functions and contain sufficient processing power to support user interaction through other modes, such as multimodal access. Devices which support multimodal access combine multiple user input modes or channels in the same interaction allowing a user to interact with the applications on the device simultaneously through multiple input modes or channels. The methods of input include speech recognition, keyboard, touch screen, stylus, mouse, handwriting, and others. Multimodal input often makes using a small device easier.
Multimodal applications often run on servers that serve up multimodal web pages for display on a multimodal browser. A ‘multimodal browser,’ as the term is used in this specification, generally means a web browser capable of receiving multimodal input and interacting with users with multimodal output. Multimodal browsers typically render web pages written in XHTML+Voice (‘X+V’). X+V provides a markup language that enables users to interact with an multimodal application often running on a server through spoken dialog in addition to traditional means of input such as keyboard strokes and mouse pointer action. X+V adds spoken interaction to standard web content by integrating XHTML (eXtensible Hypertext Markup Language) and speech recognition vocabularies supported by VoiceXML. For visual markup, X+V includes the XHTML standard. For voice markup, X+V includes a subset of VoiceXML.
Current lightweight voice solutions require a developer to build a grammar and lexicon to limit the potential number of words that an automatic speech recognition (‘ASR’) engine must recognize—as a means for increasing accuracy. Pervasive devices typically have limited interaction and input modalities due to the form factor of the device, and kiosk devices have limited interaction and input modalities by design. In both cases the use of speaker independent voice recognition is implemented to enhance the user experience and interaction with the device. The state of the art in speaker independent recognition allows for some sophisticated voice applications to be written as long as there is a limited vocabulary associated with each potential voice command. For example, if the user is prompted to speak the name of a city the system can, with a decent level of confidence, recognize the name of the city spoken.
Voice interaction features are integrated with X+V and can consequently be used directly within X+V content. X+V includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to X+V elements and respond to specific events. For synchronizing the VoiceXML elements with corresponding visual interface elements, X+V uses XML Events events (referred to in this document generally as ‘events’). The specifications for X+V may be had from the VoiceXML Forum Forum. The specifications for both XHTML and XML Events may be had from the HTML Home Page of the World Wide Consortium.
The specifications for VoiceXML may be had from the Voice Browser Activity of the World Wide Consortium.
A multimodal application may span multiple XHTML web pages. One of these web pages may specify multiple frames where each frame contains its own XHTML page. For an overview of HTML frames, see the website of the World Wide Web Consortium.
Frames allow an author to present multiple views or subwindows that a browser displays simultaneously. One common use is to separate the navigation of the application as a separate subwindow. The navigation subwindow does not change as content is updated in another subwindow. To specify multiple frames, there is a top-level XHTML document, known as a ‘frameset document,’ among the documents that comprise the application that contains a <frameset> markup element. One or more <frame> elements are disposed as markup in the frameset document as children of <frameset>. Each frame has a name so that multiple XHTML documents can be placed within it as new content. Each frame can be targeted by its name in markup that identifies a document to display in a subwindow defined by a frame. <link> and <anchor> elements within the XHTML document specify which frame will load the referenced XHTML document via a ‘target’ markup attribute. By default the current frame is the target if the ‘target’ attribute is missing. If a user activates a hyperlink in a frame by use of a mouseclick through a graphical user interface (‘GUI’), only the target frame is updated with new content.
In current art, however, only the frame currently in focus will have speech recognition grammars enabled. Because the user can see all frames displayed by the browser at one time, the user expects the grammars for all the frames to be enabled. The frames are enabled for hyperlinking through the GUI, but not by voice.
In addition, there is no targeting of a frame when voice is used to activate a hyperlink. The grammar that when matched against a user utterance activates a voice-enabled hyperlink may be derived from the link's attributes, from a title attribute, a name attribute, from another attribute, or from text between a start tag and an end tag in markup of a link. But when the user says the hyperlink's title and the link is activated, the whole page, not a target frame, will be updated with new content. All of the application's frames, including its navigation frame, will be replaced by a single new page. The frame structure defined in the frameset document is destroyed, and the application becomes a single frame application.