Spoken language understanding systems have been deployed in numerous applications which require some sort of interaction between humans and machines. Most of the time, the interaction is controlled by the machine which asks questions of the users and then attempts to identify the intended meaning from their answers (expressed in natural language) and take actions in response to these extracted meanings.
One important class of applications employs Natural Language Understanding (NLU) technology for a type of semantic classification known as “call routing,” whose goal is to semantically classify a telephone query from a customer to route it to the appropriate set of service agents based on a brief spoken description of the customer's reason for the call. Call routing systems reduce queue time and call duration, thereby saving money and improving customer satisfaction by promptly connecting the customer to the right service representative in large call centers.
Call routing applications classify spoken inputs into a small set of categories for a particular application. Spoken inputs such as “I have a problem with my bill,” “Check my balance,” “Did you get my payment?” might all be mapped to a “Billing” category. Since people express these requests in many different ways, call routers are typically implemented as a statistical classifier which is trained on a labeled corpus—that is, a set of spoken requests and their classifications.
Determining a semantic classification for a human utterance in a call routing system is typically a five-step process as illustrated by FIG. 1. Input speech from the caller is translated into a text string by an Automated Speech Recognition (ASR) Module 101. The ASR text is output into an NLU semantic classification component known as a Statistical Router 102. The Statistical Router 102 models the NLU task as a statistical classification problem in which the ASR text corresponding to an utterance is assigned to one or more of a set of predefined user intents, referred to as “call routes.” Various specific classifiers have been compared in the literature with similar performance (1-2% differences in classification accuracy), including, for example, Boosting, Maximum Entropy (ME), and Support Vector Machines (SVM). For example, Statistical Router 102 may use binary unigram features and a standard back-propagation neural network as a classifier.
Typically, to create a new call routing application, a new training corpus must initially be developed based on the specific needs of the new application. FIG. 2 shows this process generally. A training corpus 201 contains examples of sample training utterances 202 which are labeled with associated router classification tags 203. A feature set in the training corpus 201 is selected (e.g., words in the sample training utterances 202) which together with a classification model 205 (e.g., neural network) is used to build and train a call routing classifier 204 for the application. This is an expensive process because a large labeled training corpus 201 must be collected and developed for each new application. After training of the call routing classifier 204 on the training corpus 201, it can be implemented in the application to process live unlabeled incoming utterances from real users of the on-line application.
Different applications have different call routing classifiers based on their own specific needs. There is usually no simple many-to-one or one-to-many mapping from routers of one application to another. In the machine learning community, the most common way of reusing knowledge is to induce a bias for the concerned model based on the existing data, with the assumption that the “inductive bias” would also work for the new data. This assumption is often not true when the existing and new data are in different applications and domains.
A framework taking the joint outputs of different classifiers and mapping them to the desired output was described by K. D. Bollacker and J. Ghosh, A Scalable Method For Classifier Knowledge Reuse, in Proceedings of the 1997 International Conference on Neural Networks, pp. 1474-79, June 1997, which is hereby incorporated by reference. But such method is very difficult to scale due to the exponential growth of the number of joint outputs when adding in more classifiers.
Karahan et al., Combining Classifiers for Spoken Language Understanding, Proceedings of ASRU-2003, 8th Biannual IEEE workshop on Automatic Speech Recognition and Understanding (ASRU '03), U.S. Virgin Islands, Nov. 30-Dec. 3, 2003, the contents of which are incorporated by reference, described combining different classifiers scores in a final classifier combining low level features which share a single common set of meanings. This means that the sharing classifiers are trained on subsets of the same tagged training set, or with data sets that have the same set of tagged meanings. In essence, there is an injection of hard knowledge from one classifier to another in that all the classifiers are required to be trained with the same set of call routes.