The present invention relates to information categorization. More particularly, the present invention relates to multi-class, multi-label information categorization.
Information categorization is the process of classifying information samples into categories or classes. By way of example, text categorization is the process of classifying a text document, such as into a xe2x80x9cpolitics,xe2x80x9d a xe2x80x9cbusinessxe2x80x9d or a xe2x80x9csportsxe2x80x9d category, based on the document""s content. When used in connection with a speech recognition device, information categorization can be used, for example, by a telephone network provider to automatically determine the purpose of a telephone call received from a customer. If the customer says, xe2x80x9cI would like to charge this call to my credit card,xe2x80x9d the system could automatically recognize that this is a calling-card request and process the call accordingly. Note that the information is categorized xe2x80x9cautomaticallyxe2x80x9d in that human input is not required to make the decision. Although this example involves a speech-categorization problem, a text-based system can be used if the customer""s spoken message is passed through a speech recognizer.
It is known that an information categorization algorithm can xe2x80x9clearn,xe2x80x9d using information samples, to perform text-categorization tasks, such as the ones described above. For example, a document might be classified as either xe2x80x9crelevantxe2x80x9d or xe2x80x9cnot relevantxe2x80x9d with respect to a pre-determined topic. Many sources of textual data, such as Internet news feed, electronic mail and digital libraries, include different topics, or classes, and therefore pose a xe2x80x9cmulti-classxe2x80x9d categorization problem.
Moreover, in multi-class problems, a document may be relevant to several different classes. For example, a news article may be relevant to xe2x80x9cpoliticsxe2x80x9d and xe2x80x9cbusiness.xe2x80x9d Telephone call-types are also not mutually exclusive (i.e., a call can be both xe2x80x9ccollectxe2x80x9d and xe2x80x9cperson-to-personxe2x80x9d).
One approach to multi-class, multi-label information categorization is to break the task into disjoint binary categorization problems, one for each class. To classify a new information sample, such as a document, all the binary classifiers are applied and the predications are combined into a single decision. The end result can be, for example, a list of which classes the document probably belongs to, or a ranking of possible classes. Such an approach, however, can ignore any correlation that might exist between different classes. As a result, the information categorization is less effective and/or efficient than may be desired.
In view of the foregoing, it can be appreciated that a substantial need exists for an information categorization method and apparatus that is directed to the multi-class, multi-label problem and addresses the problems discussed above.
The disadvantages of the art are alleviated to a great extent by a method and apparatus for multi-class, multi-label information categorization. A weight is assigned to each information sample in a training set, the training set containing a plurality of information samples, such as text documents, and associated labels. A base hypothesis is determined to predict which labels are associated with a given information sample. The base hypothesis may predict whether or not each label is associated with the information sample, or may predict the likelihood that each label is associated with the information sample. In the case of a document, the base hypothesis may evaluate words in each document to determine one or more words that predict the associated labels.
When a base hypothesis is determined, the weight assigned to each information sample in the training set is modified based on the base hypothesis predictions. For example, the relative weight assigned to an information sample may be decreased if the labels associated with that information sample are correctly predicted by the base hypothesis. These actions are repeated to generate a number of base hypotheses which are combined to create a combined hypothesis. An un-categorized information sample can then be categorized with one or more labels in accordance with the combined hypothesis. Such categorization may include predicting which labels are associated with each information sample or ranking possible labels associated with each information sample.
With these and other advantages and features of the invention that will become hereinafter apparent, the nature of the invention may be more clearly understood by reference to the following detailed description of the invention, the appended claims and to the several drawings attached herein.