Computer-based interactive speech applications are designed to provide automated interactive communication, for example, for use in telephone-based voice-portals to answer incoming calls. A voice portal can take calls and perform various tasks based on the caller's speech. The tasks may include gathering information from callers, providing information to callers, and directing callers to appropriate parties.
Some speech applications are implemented using Voice Extensible Markup Language (VoiceXML 1.0 as defined by the World Wide Web Consortium) technology. Using VoiceXML technology, the flow of an automated dialog with a caller is controlled using one or more VoiceXML documents. Each document essentially specifies a part of the interaction between the computer and the caller in the form of a script. When a caller interacts with a system, a VoiceXML document is processed by a “browser” application to implement the specified dialogue, for example, to elicit information from a user and perform tasks based on user responses.
A VoiceXML document defines the flow of a portion of the dialogue, which may include forms and menus. A form defines an interaction for collecting values for a set of field items. For example, the field items may include credit card type, name of credit card holder, and expiration date of the card. Each field may specify a grammar that defines the allowable inputs for that field (e.g., a credit card number can only include numbers and not alphabets). A menu defines choices selectable by a user and tasks to be performed when a choice is selected. For example, a menu may contain a list of the names of stores that can be selected by the user, and universal resource locators (URL's) of additional VoiceXML documents maintained by the stores selected by the user. In response to a user selection, the web browser loads the VoiceXML document of the selected store, and performs additional dialogue tailored to the services provided by that store.
The VoiceXML documents are typically stored remotely from the computer hosting the browser application, and are retrieved by the browser application as they are needed. During the flow of a dialog, references (e.g., in a menu or form) in one VoiceXML document indicate to the browser application to retrieve other related documents. Based on the contents of the VoiceXML documents, the browser application invokes text-to-speech procedures or plays prompts to generate voice prompts based on instructions in the VoiceXML document and invokes a speech recognition procedure to recognize speech uttered by the user.
Text-to-speech procedures and speech recognition procedures are not necessarily hosted on the same computer that hosts the browser interpreting the VoiceXML script. For example, dedicated servers may host programs implementing these procedures, and whenever the browser needs the services of one of these procedures, it communicates with an appropriate server to invoke the procedure. In some systems, there may be multiple servers, for example, multiple speech recognition servers.
VoiceXML also provides a method for recording and sending the recorded audio from the browser to a server by using a <submit> tag in the VoiceXML document using a Hypertext Transfer Protocol (HTTP) POST command to send an audio recording as Multipurpose Internet Mail Extension (MIME)-encoded messages to the server. An audio recording may also be sent from a server to a browser by using a <submit> tag and a HTTP POST command. One application of this mechanism is to store and retrieve voice messages on the server.