1. Field of the Invention
The present invention relates to spoken language understanding in human computer dialogs and more specifically to a system and method of improving spoken language understanding in view of grammatically incorrect utterances and unpredictable error in the input to speech recognition modules.
2. Discussion of Related Art
The present invention relates to spoken dialog systems. Such systems typically contain well-known modules for engaging in a human-computer dialog. The modules include an automatic speech recognition module, a spoken language understanding module, a dialog management module, and a text-to-speech module. The process requires each one of these modules to process data and transmit output to the next module for recognizing speech from a person, understanding the meaning of the speech, formulating a response, and generating synthetic speech to “respond” to the person.
FIG. 1 shows the architecture of a typical spoken dialog system 100. In this architecture, speech is recognized by the speech recognition module 102 and an information extractor 104 processes the recognized text and identifies the named entities e.g. phone numbers, time, monetary amounts, in the input. After substituting a suitable symbol for the named entities the information extractor 104 passes the recognized text on to the spoken language understanding unit (SLU) 106. The SLU 106 processes this input and generates a semantic representation, i.e. transforms it into another language that can be understood by a computer program; usually called a dialog manager (DM) 108. The DM 108 is typically equipped with an interpreter 110 and a problem solver 112 to determine and generate a response to the user. The information generated by the DM 108 is transmitted to a TTS module 114 for generating synthetic speech to provide the response of the system to the user 116. Information regarding the general operation of each of these components is well known to those of skill in the art and therefore only a brief introduction is provided herein.
The present disclosure relates to the spoken language understanding module. This module receives output from the automatic speech recognition module in the form of a stream of text that represents, to the best of the systems ability, what the user has said. The next step in the dialog process is to “understand” what the user has said, which is the task of the spoken language understanding unit. Accomplishing the task of recognizing speech spoken by a person and understanding the speech through natural language understanding is a difficult task. The process increases in complexity due to several factors. First, human interactions through speech seldom contain grammatically correct utterances. Therefore, the text output transmitted to the spoken language understanding module from the recognition module will not always contain coherent sentences or statements. Second, speech recognition software introduces unpredictable error in the input. Because of these reasons, semantic analysis based on syntactic structures of the language is bound to fail.
One known attempt to achieve spoken language understanding is to apply a classifier to classify the input directly in one of the limited number of actions the dialog system can take. Such techniques work well when there are small number of classes to deal with, e.g. in call routing systems. However, these approaches do not scale well for tasks that require very large number of classes, e.g. problem-solving tasks, because it is humanly impossible to consistently label the very large amount of data that would be needed to train such a classifier.
What is needed is an improved method of processing the data to increase the accuracy of the spoken language understanding module and that is scalable to enable a general application of the spoken language understanding module beyond a specific domain.