Classification is the process by which every item in a set of items is assigned a unique class label from a predefined set of class labels. Items could be any real life entities such as documents, people, products, etc., which can be modeled as having a fixed set of attributes or features. In this document, this fixed set of features is referred to as the dictionary. The labels could be any meaningful abstraction for the entity being classified. For example {rich, poor} could be the set of class labels for the entity ‘person’.
Algorithms for supervised classification (as defined in the book ‘Machine Learning’, Tom Mitchell, 1997, McGraw Hill, pp. 54, 182-183, 191-198) have been used in a variety of fields where the similarity between the items to be classified can be inferred from a classified example set. These classification algorithms learn to map the features of the already given examples to the corresponding classes and classify new items based on the learned mapping.
The naïve Bayesian approach is a widely used supervised classification algorithm. The naïve Bayesian approach assumes that the features which represent the items occur independent of each other. Two different naïve Bayesian models used in practice are the multi-variate Bernoulli model and the multinomial model. The multi-variate Bernoulli model uses a binary vector representation for an item where a “1” denotes the presence of a feature and “0”, the absence. The mulitnomial model uses the frequency of occurrence of a feature in a class for probability calculations. For the classification of a new item, both models calculate the posterior probability that a class would have generated the given item.
A study showing that the multinomial model typically outperforms the multi-variate Bernoulli model is presented by Andrew McCallum & Kamal Nigam in “A Comparison of Event Models for Naive Bayes Text Classification” in AAAI/ICML-98, Workshop on Learning for Text Categorization, Technical Report WS-98-05, AAAI Press, 1998.
Entropy is a measure of the state of randomness of distribution within a system, and has been used to model data items outside of the field of thermodynamics. See, for example, C. E. Shannon, “A mathematical theory of communication”, Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948. (At the time of writing, a reprint version of this paper is available from the Website at ‘cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.ps.gz’). Background information can also be found in David Feldman, “A Brief Introduction to: Information Theory, Excess Entropy and Computational Mechanics”, April 1998, University of California, Davis Calif., U.S.A.
Entropy has also been used for classification. Any change in the state of a system has a resulting effect on its entropy, and it has been suggested that entropy calculations can be used to model the distribution of a variable in the field of text classification. The underlying principle of these probability distribution estimation techniques is that, in the absence of any external knowledge, one should prefer a uniform distribution that correspond to maximum entropy.
The technique described in Kamal Nigam, John Lafferty, Andrew McCallum, “Using Maximum Entropy for Text Classification”, IAJCI-99, Workshop on Machine Learning for Information Filtering, 1999, uses a labeled training set of documents to establish a set of constraints for the model. These constraints characterize the class-specific expectations for the distribution. Using the concept of maximum entropy and these constraints for a given document, the technique of Nigam et al estimates the conditional probability distribution of the classes and classifies the document using iterative scaling algorithms. The experiments on several text data sets done by Nigam, Lafferty and McCallum show that the performance of maximum entropy is sometimes better but also sometimes worse than naive Bayesian classification. This technique is also sensitive to feature selection and can perform badly in case of poor feature selection.
Many of the classification methods disclosed in published literature are for assigning class labels to a set of data items, and are not specifically designed for populating an existing concept hierarchy. Concept hierarchies can be visualized as tree structures where a child to parent relationship is a fixed and well-defined one. For example, “cars” and “trucks” are children of “vehicle”. Generally “IS-A” or “A-Kind-Of” relationships are maintained, where child is a kind of parent. Each node in a concept hierarchy has a label whose prefix is the label of its parent. Sibling classes are the set of classes at a single level of the hierarchy which have a common parent (i.e. are immediate descendants of a common ancestor node at the next level in the hierarchy). Concept hierarchies have a special node (root node) which is the ancestor of any node in the hierarchy. In this document, data items within the hierarchy are described using the example of documents within an hierarchical set of document classes, and so any reference to ‘documents’ hereafter can be generalized to any data items including any type of media.
U.S. Pat. No. 6,233,575 (issued to Agrawal et al on May 15, 2001) describes a system and process for organizing a large text database into a hierarchy of topics and for maintaining this organization as documents are added and deleted and as the topic hierarchy changes. Given sample documents belonging to various nodes in the topic hierarchy, the tokens (terms, phrases, dates, or other usable feature in the document) that are most useful at each internal decision node for the purpose of routing new documents to the children of that node are automatically detected. Using feature terms, statistical models are constructed for each topic node. The models are used in an estimation technique to assign topic paths to new unlabeled documents. U.S. Pat. No. 6,233,575 does not classify documents using entropy.
Populating hierarchical taxonomies has become an important problem in maintaining product catalogues, knowledge bases, etc. Currently, most concept hierarchies are still manually maintained.
There is a need in the art for a solution for populating an hierarchically organized set of classified data items with new data items, which at least mitigates one or more problems inherent in known classification methods. There is also a need for a solution for determining the degree of confidence in the classification of data items within a set of hierarchically organized classified data items.