Over the past decade Automated Speech Recognition (ASR) systems have progressed to the point where a high degree of recognition accuracy may be obtained by ASR systems installed on moderately priced personal computers and workstations. This has led to a rise in the number of ASR systems available for consumer and industry applications.
ASR systems rely on voice grammars to recognize vocal commands input via a microphone and act on those commands. Voice grammars fall into two categories: rule based grammars and free speech grammars. Rule based grammars allow the recognition of a limited set of predefined phrases. Each rule based grammar, if invoked, causes an event or set of events to occur. A rule based grammar is invoked if an utterance, input via a microphone, matches a speech template corresponding to a phrase stored within the set of predefined phrases. For example the user may say “save file” while editing a document in a word processing program to invoke the save command. On the other hand, free speech grammars recognize large sets of words in a given domain such as Business English. These grammars are generally used for dictation applications, some examples of these systems are Dragon Naturally Speaking and IBM Viavoice 7 Millennium. ASR systems have also incorporated text to speech (TTS) capabilities which enable ASR systems to speak graphically rendered text using a synthesized voice. For example, an ASR system can read a highlighted paragraph within a word processor aloud through speakers.
ASR systems have been integrated with web browsers to create voice enabled web browsers. Voice enabled web browsers allow the user to navigate the Internet by using voice commands which invoke rule based grammars. Some of the voice commands used by these browsers include utterances that cause the software to execute traditional commands used by web browsers. For example if the user says “home” into a microphone, a voice enabled browser would execute the same routines that the voice enabled web browser would execute if a user clicked on the “home” button of the voice enabled web browser. In addition, some voice enabled web browsers create rule based grammars based on web page content. As a web page is downloaded and displayed some voice enabled web browsers create rule based grammars based on the links contained within the web page. For example, if web page displayed a link “company home,” such a voice enabled web browser would create a rule based grammar, effective while the web page is displayed, such that if a user uttered the phrase “company home” into a microphone the voice enabled web browser would display the web page associated with the link. One shortcoming of this approach is that the rules generated from web page content are fixed over long periods of time because web pages are not redesigned often. Additionally, the rule based grammars are generated from web page content, which is primarily intended for visual display. In effect, these systems limit the user to saying what appears on the screen.
Web pages can also incorporate audio elements, which cause sound to be output. Currently web pages can incorporate audio elements into their web pages in two ways. The first way to incorporate an audio element is to use audio wave file content to provide a human sounding voice to a web page. Using audio wave files allows the web page designer to design the visual and audio portions of the web page independently, but this freedom and added functionality comes at a high price. The bandwidth required to transfer binary sound files over the Internet to the end user is large. The second way to incorporate an audio element is to leverage the functionality of an ASR system. Voice enabled web browsers may utilize the ITS functionality of an ASR system in such a way as to have the computer “speak” the content of a web page. Using this approach causes the bandwidth needed to view the page with or without the audio element be approximately the same but limits the subject matter of what the web browser can speak to the content of the web page.
Voice XML (VXML) affords a web page designer with another option. VXML allows a user to navigate a web site solely through the use of audio commands typically used over the phone. VXML requires that a TTS translator read a web page to a user by translating the visual web page to an audio expression of the web page. The user navigates the web by speaking the links the user wants to follow. With this approach a user can navigate the Internet by using only the user's voice, but the audio content is typically generated from web page content that is primarily designed for visual interpretation; and the visual interface is removed from the user's experience.
Accordingly, there exists a continuing need to independently create an audio component of a web page that does not demand a large amount of transmission bandwidth and exists in conjunction with the visual component of a web page.