Multi-category data classification is commonly performed by humans. For example, a customer service operator classifies a customer question into one of many categories supported by an expert when assigning the customer question to an expert. Generally, the information included in the customer question is insufficient to identify the relevant category, so the customer service operator is required to draw from the customer service operator's background knowledge to correctly classify the customer question. For example, classifying the question “My focus is not starting,” requires background knowledge that the term “focus” refers to a type of car. Hence, to accurately classify certain questions or other data, knowledge from external sources is sometimes needed to account for deficiencies in the information provided by the question or other data.
External knowledge used to classify data can be obtained from various sources having many forms, such as structured datasets, unstructured datasets, labeled datasets or unlabeled datasets. An example of a labeled dataset is the “YAHOO® ANSWERS” online data repository, which includes question and answer pairs organized by categories and sub-categories. Another example of an external data set is “WIKIPEDIA,” which includes multiple articles organized by titles, and loosely organized by category. As the number or external knowledge sources, or “auxiliary datasets,” has increased exponentially, research in fields such as semi-supervised learning, multi-task learning, transfer learning and domain adaptation has also increased to better identify methods for utilizing the growing number of auxiliary datasets.
Conventionally, a “bag of words” approach has been used to incorporate knowledge from auxiliary datasets into data classification; however, this approach does not incorporate knowledge from the external datasets other than mere additional words into classification. Hence, this “bag of words” approach often incorrectly classifies words having multiple meanings spanning different categories. Other conventional methods for incorporating data from auxiliary datasets into classification are also limited because different auxiliary datasets have different properties. These different properties prevent application of conventional methods of knowledge transfer, such as transfer learning, multi-task learning or self-taught learning, to multiple types of auxiliary data sets. Thus, conventional knowledge transfer methods merely allow data from a single auxiliary dataset to be incorporated into data classification, limiting the amount of external data capable of being used during classification.
Thus, what is needed is a system and method for incorporating knowledge from multiple heterogeneous auxiliary datasets into classification of input data.