1. Field of the Invention
The present invention relates to spoken dialog systems and more specifically to a system and method of automating the development of web-based spoken dialog systems.
2. Discussion of Related Art
Spoken dialog systems provide individuals and companies with a cost-effective means of communicating with customers. For example, a spoken dialog system can be deployed as part of a telephone service that enables users to call in and talk with the computer system to receive billing information or other telephone service-related information. In order for the computer system to understand the words spoken by the user, a process of generating data and training recognition grammars is necessary. The resulting grammars generated from the training process enable the spoken dialog system to accurately recognize words spoken within the “domain” that it expects. For example, the telephone service spoken dialog system will expect questions and inquiries about subject matter associated with the user's phone service. Developing such spoken dialog systems is a labor-intensive process that can take many human developers months to complete.
Many companies desire a voice interface with the company web-site. The prevalent method of creating such a spoken dialog service requires a handcrafted process of using data as well as human knowledge to manually create a task representation model that is further used for the general dialog infrastructure. Several approaches are currently used to create the dialog such as using VoiceXML, described below, and handcrafting a spoken dialog system, discussed next.
The general process of creating a handcrafted spoken dialog service is illustrated in FIG. 1. The process requires a database of information and human task knowledge (102). For example, to provide a voice interface to a web-site, human interaction is required to review the text of the web-site and manually assign parameters to the text in order to train the various automatic speech recognition, natural language understanding, dialog management and text-to-speech modules in a spoken dialog system.
A typical spoken dialog system includes the general components or modules illustrated in FIG. 2. The spoken dialog system 110 may operate on a single computing device or on a distributed computer network. The system 110 receives speech sounds from a user 112 and operates to generate a response. The general components of such a system include an automatic speech recognition (“ASR”) module 114 that recognizes the words spoken by the user 112. A spoken language understanding (“SLU”) module 116 associates a meaning to the words received from the ASR 114. A dialog management (“DM”) module 118 manages the dialog by determining an appropriate response to the customer question. Based on the determined action, a language generation (“LG”) module 120 generates the appropriate words to be spoken by the system in response and a Text-to-Speech (“TTS”) module 122 synthesizes the speech for the user 112.
Returning to FIG. 1, the “domain” related to the subject matter of the web-site and the modules must be trained in order to provide a spoken dialog that is sufficiently error-free to be acceptable. The handcrafted process results in a task representation model (104) that is then used to generate the dialog infrastructure (106).
Once a design team completes the spoken dialog system for a particular web-site, the system is complete and “static.” That is, the system is up-to-date for the current status of the products, services, and information contained on the company web-site at the time the system is deployed. However, if a new product or services offering is added to the web-site, the company must update the spoken dialog system since the “domain” of information is now different. Humans must then again review the updated web-site and provide the further information and parameters to the spoken dialog system to keep it up to date. This process can quickly become expensive beyond the initial development phase to keep the spoken dialog system current.
The difficulty with the training component of deploying a spoken dialog system is that the cost and time required precludes some companies from participating in the service. The cost may keep smaller companies from seeking this money-saving service. Larger companies may be hindered from employing such a service because of the delay required to prepare the system.
As mentioned above, another attempt at providing a voice interface to a web-site is VoiceXML (Voice Extensible Markup Language). VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications. However, VoiceXML requires programming each user interaction. FIG. 3 illustrates a portion of the source code for a Voice XML dialog. Using the source code of FIG. 3, the following dialog can occur:                C (computer): Would you like coffee, tea, milk, or nothing?        H (human): Orange juice.        C: I did not understand what you said.        C: Would you like coffee, tea, milk, or nothing?        H: Tea        C: (continues dialog . . . ).        
Such a VoiceXML dialog must be programmed by the web programmer and any update to the web-site, such as a new product offering, will also require reprogramming to synchronize and bring the spoken dialog interaction up to date. Therefore, the VoiceXML programming language suffers from the same difficulties as does the standard method of generating a spoken dialog system in that it is costly to program and costly to keep the voice interface up-to-date as web-site content changes.
Other task representation models include an object-based model (discussed in: Abella, A. and Gorin, A. L., “Construct algebra: Analytical dialog management”, Proc. ACL, Washington, D.C., 20-26, Jun. 1999), a table-based model (discussed in Roberto Pieraccinib, Esther Levin, Wieland Eckert, “AMICA: the AT&T Mixed Initiative Conversational Architecture”, EuroSpeech97, Vol. 4, pp 1875-1878 (1997)) and a script-based model (discussed in Xu, W. and Rudnicky, A, “Task-based dialog management using an agenda”, ANLP/NAACL 2000 Workshop on Conversational Systems, May 2000, pp. 42-4). Within these frameworks, application authors are required to carefully define the relationships that exist in the task knowledge and predict all possible dialog states. However, experience has shown that application logics usually go beyond the scope that the generic dialog infrastructure and task representation model can cover. Even if a perfect generic dialog infrastructure and the associated efficient task representation model can be found, collecting task data, analyzing it and tailoring the data into these models still remain laborious and require tremendous expertise. This is also a crucial bottleneck for speech, natural language and dialog technologies to make contributions to wide applications.
Other attempts at providing a dialog access to web-site data is provided by a company called Soliloquy. Soliloquy provides a product called a “dialog expert” that enables customers to ask about what information, products or services a company provides. The product provides a dialog exchange that helps the customer to receive information about the company. Soliloquy attempts to provide a natural language dialog on a web-site as if the customer were talking with a real person. While Soliloquy's dialog experts provide some interaction with the user, there are many limitations on its use.
First, Soliloquy's dialog expert uses a table search method. In this method, a table of potential answers to questions is created as well as keywords that may be used to trigger the answer to those questions. Such systems are handcrafted and, as such, they require time and money to generate the dialog expert.
Further, if a company using Soliloquy's product wanted to update the information related to the dialog expert, the tables that are searched when communicating with an end user would have to be updated and the dialog expert updated as well. This is a costly and lengthy process that can reduce the profitability of using the service if products continually are added and dropped.
Another service at www.internetspeech.com provides some speech capabilities to web-sites. This service enables a user to call a telephone number and, through a voice dialog, navigate the Internet. The dialog process may be as follows, after entering an account number and PIN number:
Computer: Welcome to NetEcho. Please select from the following menu by saying “that one” or the category name: surf the net . . . web search . . . stock quote . . . my email . . . today's news . . . more options . . . enable link . . . .
Person: That one (user says after the computer says “enable link” to select “enable link”).
Computer: Link mode is enabled. Main menu, please select from the following menu: . . . surf the net . . . popular web-sites . . . .
User: Surf the net.
Computer: To go to our popular web-site, say “popular web-site;” to go to bookmarked web-sites, say “bookmarked web-site.”
User: Popular web-site.
Computer: Please say your popular web-site.
User: LHS
Computer: I think you said LHS.com, correct?
User: Yes
Computer: Thank you, I am going to LHS.com Company highlights. LHS introduces voice express version 5. Link. Their Dictaphone demonstrates call center solutions at first conference. Link. Internetspeech.com selects Lernout and Hauspie Text-to-Speech for its audio Internet product. Link.
User: That one.
Computer: Internetspeech.com, an audio internet service provider, today announced a licensing agreement with Lernout and Hauspie, seller of speech language products, technologies and services, to integrate Realspeak in their Netecho product . . . .
User: Stop.
The above dialog enables the user to obtain access to web-site content via the telephone. As is clear from the dialog, however, the user still must navigate a menu system. The computer identifies links to the user by stating paragraphs from a web-site and then stating “link.” From this, the user may listen to headlines or statements associated with each link and then say “that one” to go to the linked information. While this method enables a user to get to web-site content, this process is cumbersome. For example, if a user desires to receive information that may be contained in the last paragraph of an article, the user must select links to get to an article and then listen to the entire article until getting to the desired information.
What is needed is a system and method of audibly navigating a web-site that enables a user to quickly receive web-site content. Further, what is needed is a system and method of quickly creating a spoken language dialog service for a web site that is also easily updated and maintained with less human intervention.