The invention disclosed herein relates generally to voice activating web pages. More particularly, the present invention provides systems and methods for voice activating multiple windows containing web pages and complex web pages.
Over the past decade Automated Speech Recognition (“ASR”) systems have progressed to the point where a high degree of recognition accuracy may be obtained by ASR systems installed on moderately priced personal computers and workstations. This has led to a rise in the number of ASR systems available for consumer and industry applications.
ASR systems rely on voice grammars to recognize vocal commands input via a microphone and act on those commands. Voice grammars fall into two categories: rule-based grammars and free speech grammars. Rule-based grammars allow the recognition of a limited set of predefined phrases. Each rule-based grammar, is invoked, if an utterance causes an event or set of events to occur. A rule-based grammar is invoked if an utterance, input via a microphone, matches a speech template corresponding to a phrase stored within the set of predefined phrases. For example the user may say “save file” while editing a document in a word processing program to invoke the save command.
On the other hand, free speech grammars recognize large sets of words in a given domain such as Business English. These grammars are generally used for dictation applications. Some examples of these systems are Dragon Naturally Speaking and IBM ViaVoice 7 Millennium. ASR systems have also incorporated text to speech (“TTS”) capabilities which enable ASR systems to speak graphically rendered text using a synthesized voice. For example, an ASR system can read a highlighted paragraph within a word processor aloud through speakers.
ASR systems have been integrated with web browsers to create voice-enabled web browsers. Voice-enabled web browsers allow the user to navigate the Internet by using voice commands which invoke rule-based grammars. Some of the voice commands used by these browsers include utterances that cause the software to execute traditional commands used by web browsers. For example if the user says “home” into a microphone, a voice enabled browser would execute the same routines that the voice-enabled web browser would execute if a user clicked on the “home” button of the voice-enabled web browser.
In addition, some voice-enabled web browsers create rule-based grammars based on web page content. As a web page is downloaded and displayed some voice enabled web browsers create rule-based grammars based on the links contained within the web page. For example, if web page displayed a link “company home,” such a voice enabled web browser would create a rule-based grammar, effective while the web page is displayed, such that if a user uttered the phrase “company home” into a microphone the voice enabled web browser would display the web page associated with the link. One shortcoming of this approach is that the rules generated from web page content are fixed over long periods of time because web pages are not redesigned often. Additionally, the rule-based grammars are generated from web page content, which is primarily intended for visual display. In effect these systems limit the user to saying what appears on the screen.
Web pages can also incorporate audio elements, which cause sound to be output. Currently web pages can incorporate audio elements into their web pages in two ways. The first way to incorporate an audio element is to use audio wave file content to provide a human sounding voice to a web page. Using audio wave files allows the web page designer to design the visual and audio portions of the web page independently, but this freedom and added functionality comes at a high price. The bandwidth required to transfer binary sound files over the Internet to the end user is extremely large.
The second way to incorporate an audio element is to leverage the functionality of an ASR system. Voice enabled web browsers may utilize the TTS functionality of an ASR system in such a way as to have the computer “speak” the content of a web page. Using this approach causes the bandwidth needed to view the page with or without the audio element be approximately the same but limits the subject matter of what the web browser can speak to the content of the web page.
Voice XML (VXML) affords a web page designer with another option. VXML allows a user to navigate a web site solely through the use of audio commands typically used over the phone. VXML requires that a TTS translator read a web page to a user by translating the visual web page to an audio expression of the web page. The user navigates the web by speaking the links the user wants to follow. With this approach a user can navigate the Internet by using only the user's voice, but the audio content is typically generated from web page content that is primarily designed for visual interpretation; and the visual interface is removed from the user's experience.
Thus, the inventors addressed the need to independently create an audio component of a web page that does not demand a large amount of transmission bandwidth and exists in conjunction with the visual component of a web page by inventing the system further described in Patent Cooperation Treaty International Application No. PCT/US01/45223, which is hereby incorporated herein by reference in its entirety. The '45223 application discloses systems and methods for, among other things, activating voice content in a single, simple visual web page.
The system of the '45223 application controls speech content within a web page via a proxy server that has access to the same computer device (or sound output channels of this device) as the browser. The proxy server examines data for speech content while at the same time feeding all other requested data to the browser. In the case where a user clicks on a link, data requested by the browser, specified by a URL, is passed through proxy server to the specified web server. The requested material from the web server is passed back to the browser.
In the case where a new URL is requested by the user via a speech event however, the proxy server requests this data (e.g.—executing a specified program or other command on the web server)] from the specified Web Server via the browser. Thus, the resultant data needs to be “pushed” back to the browser. This is accomplished via the use of the multipart/x-mixed-replace mime type further described in the '45223 application. This type causes the browser to hold open the connection between the browser and the proxy server, and to continue accepting data until a given signal or other token is sent or the connection is closed. For example, termination may occur because of a new “click” requested from the browser or because there is no speech content in the new page. The circumstances for termination are further described in the truth tables as shown in FIG. 3B of the '45223 application and further described therein.
The inventors have identified additional improvements, further described herein, including how to extend the system to work with web pages that contain complex, aggregate content or content from multiple pages operating simultaneously, for example in multiple instances or windows of a given browser, or in multiple frames within a given window.