The present invention relates generally to the Internet and other computer networks, and more particularly to techniques for communicating information over such networks via an audio interface.
The continued growth of the Internet has made it a primary source of information on a wide variety of topics. Access to the Internet and other types of computer networks is typically accomplished via a computer equipped with a browser program. The browser program provides a graphical user interface which allows a user to request information from servers accessible over the network, and to view and otherwise process the information so obtained. Techniques for extending Internet access to users equipped with a telephone or other type of audio interface device have been developed, and are described in, for example, D. L. Atkins et al., xe2x80x9cIntegrated Web and Telephone A Language Interface to Networked Voice Response Units,xe2x80x9d Workshop on Internet Programming Languages, ICCL ""98, Loyola University, Chicago, Ill., May 1998, both of which are incorporated by reference herein.
Current approaches to web-based voice dialog generally fall into two categories. The first category includes those approaches that use HyperText Markup Language (HTML) and extensions such as Cascading Style Sheets (CSS) to redefine the meaning of HTML tags.
The second of the two categories noted above includes those approaches that utilize a new language specialized for voice interfaces, such as Voice eXtensible Markup Language (VoiceXML) from the VoiceXML Forum (which includes Lucent, ATandT and Motorola), Speech Markup Language (SpeechML) from IBM, or Talk Markup Language (TalkML) from Hewlett-Packard. These languages may be viewed as presentation mechanisms that address primarily the syntactic issues of the voice interface. The semantics of voice applications on the web are generally handled using custom solutions involving either client-side programming such as Java and Javascript or server-side methods such as Server-Side Include (SSI) and Common Gateway Interface (CGI) programming. In order to create a rich dialog interface to a computer application using these language-based approaches, an application developer generally must write explicit specifications of the sentences to be understood by the system, such that the actual spoken input can be transformed into the equivalent of a mouse-click or keyboard entry to a web form.
Examples of web-based voice dialog systems are described in U.S. patent application Ser. No. 09/168,405, filed Oct. 6, 1998 in the name of inventors M. K. Brown et al. and entitled xe2x80x9cWeb-Based Platform for Interactive Voice Response,xe2x80x9d which is incorporated by reference herein. More specifically, this application discloses an Interactive Voice Response (IVR) platform which includes a speech synthesizer, a grammar generator and a speech recognizer. The speech synthesizer generates speech which characterizes the structure and content of a web page retrieved over the network. The speech is delivered to a user via a telephone or other type of audio interface device. The grammar generator utilizes textual information parsed from the retrieved web page to produce a grammar. The grammar is then supplied to the speech recognizer and used to interpret voice commands generated by the user. The grammar may also be utilized by the speech synthesizer to create phonetic information, such that similar phonemes are used in both the speech recognizer and the speech synthesizer.
The speech synthesizer, grammar generator and speech recognizer, as well as other elements of the IVR platform, may be used to implement a dialog system in which a dialog is conducted with the user in order to control the output of the web page information to the user. A given retrieved web page may include, for example, text to be read to the user by the speech synthesizer, a program script for executing operations on a host processor, and a hyperlink for each of a set of designated spoken responses which may be received from the user. The web page may also include one or more hyperlinks that are to be utilized when the speech recognizer rejects a given spoken user input as unrecognizable.
Despite the advantages provided by the existing approaches described above, a need remains for further improvements in web-based voice dialog interfaces. More specifically, a need exists for a technique which can provide many of the advantages of both categories of approaches, while avoiding the application development difficulties often associated with the specialized language based approaches.
The present invention provides an improved voice dialog interface for use in web-based applications implemented over the Internet or other computer network.
In accordance with the invention, a web-based voice dialog interface is configured to communicate information between a user at a client machine and one or more servers coupled to the client machine via the Internet or other computer network. The interface in an illustrative embodiment includes a web page interpreter for receiving information relating to one or more web pages. The web page interpreter generates a rendering of at least a portion of the information for presentation to a user in an audibly-perceptible format. The web page interpreter may make use of certain pre-specified voice-related tags, e.g., HTML extensions. A grammar processing device utilizes interpreted web page information received from the web page interpreter to generate syntax information and semantic information. A speech recognizer processes received user speech in accordance with the syntax information, and a natural language interpreter processes the resulting recognized speech in accordance with the semantics information to generate output for delivery to a web server in conjunction with a voice dialog which includes the user speech and the rendering of the web page(s). The output may be processed by a common gateway interface (CGI) formatter prior to delivery to a CGI associated with the web server.
The grammar processing device may include a grammar compiler, and may implement a grammar generation process to generate a grammar specification language which is supplied as input to a grammar compiler. The grammar generation process may utilize a thesaurus to expand the grammar specification language.
In accordance with another aspect of the invention, the web page interpreter may further generate a client library associated with interpretations of web pages previously performed on a common client machine. The client library will generally include a script language definition of semantic actions, and may be utilized by a web server in generating an appropriate response to a user speech portion of a dialog.
In accordance with a further aspect of the invention, dialog control may be handled by representing a given dialog turn in a single web page. In this case, a finite-state dialog controller may be implemented as a sequence of web pages each representing a dialog turn.
In accordance with yet another aspect of the invention, the processing operations of the web-based voice dialog interface are associated with an application developed using a dialog application development tool. The dialog application development tool may include an authoring tool which (i) utilizes a grammar specification language to generate output in a web page format for delivery to one or more clients, and (ii) parses code to generate a CGI output for delivery to the web server.
Advantageously, the techniques of the invention allow a voice dialog processing system to reduce client-server traffic and perform immediate execution of client-side operations. Other advantages include less computational burden on the web server, the elimination of any need for specialized natural language knowledge at the web server, a simplified interface, and unified control at both the client and the server.