1. Field of the Invention
The present invention relates to spoken dialog systems and more specifically to a system and method of automatically constructing a dialog system for a web site.
2. Background
Spoken dialog systems provide individuals and companies with a cost-effective means of communicating with customers. For example, a spoken dialog system can be deployed as part of a telephone service that enables users to call in and talk with the computer system to receive billing information or other telephone service-related information. In order for the computer system to understand the words spoken by the user, a process of generating data and training recognition grammars is necessary. The resulting grammars generated from the training process enable the spoken dialog system to accurately recognize words spoken within the “domain” that it expects. For example, the telephone service spoken dialog system will expect questions and inquiries about subject matter associated with the user's phone service. Developing such spoken dialog systems is a labor-intensive process that can take many human developers months to complete.
Many companies desire a voice interface with the company website. The prevalent method of creating such a spoken dialog service requires a handcrafted process of using data as well as human knowledge to manually create a task representation model that is further used for the general dialog infrastructure. Several approaches are currently used to create the dialog such as using VoiceXML and handcrafting a spoken dialog system.
The general process of creating a handcrafted spoken dialog service is illustrated in FIG. 1. The process requires a database of information and human task knowledge (102). For example, to provide a voice interface to a website, human interaction is required to review the text of the website and manually assign parameters to the text in order to train the various automatic speech recognition, natural language understanding, dialog management and text-to-speech modules in a spoken dialog system.
A typical spoken dialog system includes the general components or modules illustrated in FIG. 2. The spoken dialog system 200 may operate on a single computing device or on a distributed computer network. The system 200 receives speech sounds from a user 202 and operates to generate a response. The general components of such a system include an automatic speech recognition (“ASR”) module 204 that recognizes the words spoken by the user 202. A spoken language understanding (“SLU”) module 206 associates a meaning to the words received from the ASR 204. A dialog management (“DM”) module 208 manages the dialog by determining an appropriate response to the customer question. Based on the determined action, a spoken language generation (“SLG”) module 210 generates the appropriate words to be spoken by the system in response and a Text-to-Speech (“TTS”) module 212 synthesizes the speech for the user 202. Data and rules 214 are used to process data in each module.
Returning to FIG. 1, the “domain” related to the subject matter of the website and the modules must be trained in order to provide a spoken dialog that is sufficiently error-free to be acceptable. The handcrafted process results in a task representation model (104) that is then used to generate the dialog infrastructure (106).
As mentioned above, another attempt at providing a voice interface to a website is VoiceXML (Voice Extensible Markup Language). VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications. However, VoiceXML requires programming each user interaction. The VoiceXML programming language suffers from the same difficulties as does the standard method of generating a spoken dialog system in that it is costly to program and costly to keep the voice interface up-to-date as website content changes.