In an electronic repository, imprecisely classified documents are lost documents, a drain on productivity. There are no universally accepted standards for classifying or categorizing documents. A class or category is a group, set, or kind sharing common attributes or a division within a system of classification. Categories vary from one industry to the next and from one organization to another. There are two types of categorization: flat, in which categories are independent of each other, and hierarchical, where relations between categories themselves are exploited by the system (e.g., “molecular biology” is a sub-category or sub-class or “biology”, but is also related to the category “chemistry”.)
Classification and categorization schemes typically involve assigning labels to an object (where an object may be a document, or arbitrary co-occurrence data in a document or a vector in an arbitrary vector space and where a label is a descriptive or identifying word or phrase). We address the problem of assigning multiple labels to an object, where each label is taken among multiple (i.e., more than two) classes or categories. Although it may seem at first glance that this problem is similar to multi-class, single-label classification, it is both much less studied and quite different in nature. The problem of assigning multiple labels to a single object may be described in terms of document categorization, although it applies naturally to arbitrary objects (e.g., images, sensor signals, etc.).
Single-label classification also goes by the name of discrimination, and may be seen as a way to find the class that is best suited to a document. In a way, the essence (and limitation) of single-label classification is well represented by the semantics of the word “discriminate,” that is “to recognize a distinction between things”. On the other hand, multi-label classification is more concerned with identifying likeness between the document and (potentially) several classes. In the context of newswire stories, for example, labels are often overlapping, or may have a hierarchical structure. A story on Apple's iPod, for example, may be relevant to “computer hardware”, its sub-category “audio peripheral” as well as the “MP3 player” category. Accordingly, multi-label classification is more relevant to identifying likeness than distinction.
Current classification technology focuses on discrimination methods, for example: linear discriminants such as linear least squares, Fisher linear discriminant or Support Vector Machines (SVM); decision trees; K-nearest neighbors (KNN); neural networks, including multi-layer perceptrons (MLP) and radial basis function (RBF) networks; and probabilistic generative models based e.g., on mixtures (typically Gaussian mixtures). In addition, some techniques have been proposed to address more specifically document categorization, such as Rocchio's, Naïve Bayes, or related probabilistic methods, as described e.g., by Gaussier et al., “A hierarchical model for clustering and categorising documents”, in F. Crestani, M. Girolami and C. J. van Rijsbergen (eds), Advances in Information Retrieval—Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, Lecture Notes in Computer Science 2291, Springer, pp. 229–247, 2002.
None of these classification techniques address the problem of assigning multiple labels to a single document or object, but virtually all of them can be altered to do it, for example, by using one of the following two alternative techniques. The first technique consists of first building a binary classifier (e.g., using SVM) for each class and then using these independently to provide any number of labels. The second one applies to probabilistic methods that typically produce a posterior class probability P(c|d). Rather than assign document d to the class c that has maximum probability, the alternative is to choose a threshold and assign the document to all classes exceeding it.
The inventors' co-pending application D/A0A25 addresses the problem of clustering documents using probabilistic models. Clustering and categorization can be seen as two sides of the same coin, and differ by the fact that categorization is a supervised task, i.e., labels identifying categories are provided for a set of documents (the training set), whereas, in the case of clustering the aim is to automatically organize unlabelled documents into clusters, in an unsupervised way. The D/A0A25 model lies in its capacity to deal with hierarchies of clusters, based on soft assignments while maintaining a distinction between document and word structures.
What is needed is a method that allows the assignment of objects to multiple categories or classes such that the number of categories may be larger than two (multi-class) and such that each object may be assigned to more than one category (multi-label).