1. Field of the Invention
The present invention relates to providing a spoken dialog system and more specifically to providing an automated data-collection component to the design process of building a spoken dialog system.
2. Introduction
The present invention relates to spoken dialog systems. FIG. 1 illustrates the general features of such a dialog system 100 that enables a person to interact with a computer system using natural language. Natural language spoken dialog system 100 may include an automatic speech recognition (ASR) module 102, a spoken language understanding (SLU) module 104, a dialog management (DM) module 106, a spoken language generation (SLG) module 108, and a text-to-speech (TTS) module 110.
ASR module 102 may analyze speech input and may provide a transcription of the speech input as output. SLU module 104 may receive the transcribed input and may use a natural language understanding model to analyze the group of words that are included in the transcribed input to derive a meaning from the input. DM module 106 may receive the meaning of the speech input as input and may determine an action, such as, for example, providing a spoken response, based on the input. SLG module 108 may generate a transcription of one or more words in response to the action provided by DM 106. TTS module 110 may receive the transcription as input and may provide generated audible speech as output based on the transcribed speech.
Thus, the modules of system 100 may recognize speech input, such as speech utterances, may transcribe the speech input, may identify (or understand) the meaning of the transcribed speech, may determine an appropriate response to the speech input, may generate text of the appropriate response and from that text, generate audible “speech” from system 100, which the user then hears. In this manner, the user can carry on a natural language dialog with system 100. Those of ordinary skill in the art will understand the programming languages and means for generating and training ASR module 102 or any of the other modules in the spoken dialog system. Further, the modules of system 100 may operate independent of a full dialog system. For example, a computing device such as a smartphone (or any processing device having a phone capability) may have an ASR module wherein a user may say “call mom” and the smartphone may act on the instruction without a “spoken dialog.”
Natural language spoken dialogue systems 100 aim to identify user intents and take actions accordingly, to satisfy their requests. One purpose of such systems is to automate call routing where the task is to classify the intent of the user, expressed in natural language, into one or more predefined call-types. Then, according to the utterance classification call-type, contextual information and other service parameters, the DM 106 would decide on the next prompt or direct the call to a specific destination such as an automated Interactive Response System (IVR) or a live operator. As a call classification example, consider the utterance “I would like to know my account balance,” in a customer care banking application. Assuming that the utterance is recognized correctly, the corresponding intent or the call-type would be Request(Account_Balance). The action would be to prompt for the account number and provide the account balance or route the call to the Billing Department.
Typically some initial task data are needed for designing the system and determining the nature of the call-types. This data can then be used in training the ASR module 102 and the SLU 104 classifier models to bootstrap the initial version of the system. Since human-human interactions are very different from human-machine interactions in terms of style, language and linguistic behavior, initial data is collected via hidden human agent systems. Some have referred to these hidden human agents as “wizard-of-oz” agents. In such systems, the users only interact with a hidden human agent who simulates the behavior of the system in such a way that the caller believes he is interacting with the real system. The amount of data required to properly capture the caller's naturally expressed intentions varies and depends on the application domains. Best practice in the natural language service field suggests that ten or fifteen thousand utterances are needed to bootstrap a system with reasonable ASR and SLU coverage. In these real-world service scenarios, the systems tend not to scale in terms of cost and time required to complete the initial data collection.
For routing applications, where the user intentions are typically expressed in the first few turns of the dialogue, a simpler approach, which may be called a “ghost wizard”, has been used in some natural language data collections without requiring a human ‘behind the curtains’. In that case, the initial system greets users and records one or two user responses. Although a ghost wizard approach scales better for large data collections since it does not require a live person, the ghost wizard does not handle generic discourse illocutionary acts like vague questions, greetings, thanks and agreements that have little or no relevance for the actual service task. Also, in cases where the user has made a specific request in the first turn, the ghost wizard may result in user annoyance.
What is needed in the art is an improved system and method of providing data collection when designing and building a spoken dialog system. Such a system should enable scaling and improved collection of data.