1. Field of the Invention
The present invention relates to spoken dialog systems and more specifically to a system and method of using human-human labeled utterance data for training spoken language understanding systems.
2. Introduction
Spoken dialog systems require various components or modules to intelligently receive human speech, understand the speech and intent of the speaker, generate an appropriate response text, and synthesize an audible response. The natural language understanding (NLU) module within the spoken dialog system receives the text from an automatic speech recognition module and determines the intent or the understanding of the utterance. At the heart of the NLU is a semantic classifier. This semantic classifier is trained off-line to make such a determination using labeled utterances. Training utterances may be obtained from several different sources. For example, a company that is developing an NLU system may have recordings of communications between its call center and customers. If the call center is staffed by humans, then these would be human-human utterances. Human-machine dialog typically refer to dialogs between a computer system and a human, such as a customer talking to an automated dialog system.
Before the deployment of a new NLU system, human-machine dialogs necessary for training a semantic classifier may not be available. On the other hand, human-human utterances are much more commonly available since companies typically already have such recordings and they do not cost nearly as much to obtain. Since the human-human dialogs do not represent the actual human-machine dialogs, training the semantic classifier using human-human utterances directly does not give a good model for human-machine interaction. The call-type distribution, length, perplexity and some other characteristics of human-human utterances are very different than human-machine utterances. For example, some very frequent call types are missing (like requesting a customer service representative). Human-human utterances are on average three times longer than human-machine utterances and include multiple sentences and sentential clauses. The classifier performance is generally worse on utterances meant for human interaction. Long incoherent utterances, that typically contain more than one semantic class, confuse the learning algorithm, because they contain many features. Most of these features are totally useless for the task at hand. Therefore the classifier not only must learn what the important features are, it must also learn which features are associated with which class. As can be appreciated, when training a semantic classification model for an NLU module, human-human interactions, which are generally available, are not always helpful. However, training the NLU module is costly and requires experts to perform the task because of the lack of usable training data. Accordingly, what is needed in the art is a more efficient way to train NLU systems using existing utterances.