1. Field of the Invention
The present invention relates spoken dialog systems and more specifically to a system and method of using semantic and syntactic graphs for utterance classification.
2. Introduction
Goal-oriented spoken dialog systems aim to identify the intent of a human caller, expressed in natural language, and take actions accordingly to satisfy the caller's requests. The intent of each speaker is identified using a natural language understanding component. This step can be seen as a multi-label, multi-class call classification problem for customer care applications. An example customer care application may relate to a bank having a call-in dialog service that enables a bank customer to perform transactions over the phone audibly. As an example, consider the utterance, “I would like to know my account balance,” from a financial domain customer care application. Assuming that the utterance is recognized correctly by the automatic speech recognizer (ASR), the corresponding intent (call-type) would be “Request (Balance)” and the action would be telling the balance to the user after prompting for the account number or routing this call to the billing department.
Typically these application-specific call-types are pre-designed and large amounts of utterances manually labeled with call-types are used for training call classification systems. For classification, generally word n-grams are used as features: In the “How May I Help You?” (HMIHY) call routing system, selected word n-grams, namely “salient phrases,” which are salient to certain call types play an important role. For instance, for the above example, the salient phrase “account balance” is strongly associated with the call-type “Request (Balance).” Instead of using salient phrases, one can leave the decision of determining useful features (word n-grams) to a classification algorithm. An alternative would be using a vector space model for classification where call-types and utterances are represented as vectors including word n-grams.
Call classification is similar to text categorization, except that the utterances are much shorter than typical documents used for text categorization (such as broadcast news or newspaper articles); since it deals with spontaneous speech, the utterances frequently include disfluencies or are ungrammatical; and, ASR output is very noisy, typically one out of every four words is misrecognized.
Even though the shortness of the utterances may seem to imply the easiness of the call classification task, unfortunately, this is not the case. The call classification error rates typically range between 15% to 30%, depending on the application. This is mainly due to the data sparseness problem because of the nature of the input. Even for simple call-types like “Request (Balance),” there are many ways of uttering the same intent. Some examples include: “I would like to know my account balance,” “How much do I owe you,” “How much is my bill,” “What is my current bill,” “I'd like the balance on my account,” “account balance,” “You can help me by telling me what my phone bill is.” The current classification approaches continue to perform intent classification using only the words within the utterance.
Given this data sparseness, current classification approaches require an extensive amount of labeled data in order to train a classification system with a reasonable performance. What is needed in the art is an improved system and method for spoken language understanding and classification.