This invention relates to methods and apparatus for classifying an instance (i.e., a data item or a record) automatically into one or more classes that are selected from a set of potential classes.
The volume of machine-readable data that currently is available, for example on the Internet, is growing at a rapid rate. In order to realize the potentially huge benefits of computer access to this data, the data must be classified into categories (or classes). Traditionally, such data has been classified by humans. As the amount of data has increased, however, manual data interpretation has become increasingly impractical. Recently, classifiers have been implemented to classify data automatically into one or more potential classes.
A classifier provides a function that maps (or classifies) an instance into one of several predefined potential classes. In particular, a classifier predicts one attribute of a set of data given one or more attributes. The attribute being predicted is called the label, and the attributes used for prediction are called descriptive attributes. A classifier typically is constructed by an inducer, which is an algorithm that builds the classifier from a training set. The training set consists of records containing attributes, one of which is the class label. After a classifier has been built, its structure may be used to classify unlabeled records as belonging to one or more of the potential classes.
Many different classifiers have been proposed.
For example, a Decision Tree classifier is a well-known classifier. A Decision Tree classifier typically is built by a recursive partitioning inducing algorithm. A univariate (single attribute) split is chosen for the root of the decision tree using some criterion (e.g., mutual information, gain-ratio, or gini index). The data is then divided according to the test criterion, and the process repeats recursively for each child. After a full tree is built, a pruning step is executed to reduce the size of the tree. Generally, Decision Tree classifiers are preferred for serial classification tasks (i.e., once the value of a key feature is known, dependencies and distributions change). In addition, Decision Tree classifiers are preferred for classification tasks in which complexity may be reduced by segmenting data into sub-populations. Decision Tree classifiers also are preferred for classification tasks in which some features are more important than others. For example, in a mushroom dataset (a commonly used benchmark dataset), the odor attribute alone correctly predicts whether a mushroom is edible or poisonous with an accuracy of about 98%.
Another well-known classifier is the Naxc3xafve Bayes classifier. The Naxc3xafve Bayes classifier uses Bayes rule to compute the probability of each class given an instance, assuming attributes are conditionally independent for a given class label. The Naxc3xafve Bayes classifier requires estimation of the conditional probabilities for each attribute value given the class label. Naxc3xafve Bayes classifiers are very robust to irrelevant attributes and classification takes into account evidence from many attributes to make the final prediction, a property that is useful in many cases for which there is no xe2x80x9cmain effect.xe2x80x9d Naxc3xafve Bayes classifiers also are preferred when the attributes are conditionally independent.
A neural network is another well-known classifier. A neural network is a multilayer, hierarchical arrangement of identical processing elements (or neurons) Each neuron may have one or more inputs but only one output. Each neuron input is weighted by a coefficient. The output of a neuron typically is a function of the sum of its weighted inputs and a bias value. This function, also referred to as an activation function, is typically a sigmoid function. In the hierarchical arrangement of neurons, the output of a neuron in one layer may be distributed as an input to one or more neurons in a next layer. A typical neural network may include an input layer and two distinct layers: an input layer, an intermediate neuron layer, and an output neuron layer. The neural network is initialized and trained on known inputs having known output values (or classifications). Once the neural network is trained, it may be used to classify unknown inputs in accordance with the weights and biases determined during training.
Still other classifiers have been proposed.
The invention features a novel multi-class classification approach that enables instances to be classified with high accuracy, even when the number of classes (or categories) is very large. In particular, the classification error rate dependence of the invention on the number of potential classes is substantially less pronounced than in other, known classification approaches under some conditions.
In one aspect, the invention features a method of classifying an instance into one or more classes that are selected from a set of potential classes. In accordance with this inventive method, a subset of two or more classes to which the instance is determined to most likely belong is selected from the set of potential classes. A second-stage classifier, which is referred to herein as a xe2x80x9cscrutiny classifier,xe2x80x9d is generated from a set of training records corresponding to a class set inclusive of the selected subset of classes, and is applied to the instance to identify at least one class to which the instance most likely belongs.
As used herein the terms xe2x80x9cinstance,xe2x80x9d xe2x80x9crecord,xe2x80x9d and xe2x80x9cdata itemxe2x80x9d are intended to be synonymous.
Embodiments of the invention may include one or more of the following features.
In one embodiment, the scrutiny classifier is generated by a decision tree inducing algorithm (e.g., a C 4.5 type decision tree inducing algorithm). In other embodiments, the scrutiny classifier may be generated by a different kind of inducing algorithm (e.g., a Naxc3xafve Bayes inducing algorithm or a neural network inducing algorithm).
The scrutiny classifier preferably is generated from the set of training records. In some embodiments, the scrutiny classifier is generated on-the-fly from a set of training records corresponding to the selected subset of classes. In other embodiments, the scrutiny classifier is generated beforehand in anticipation of the instance to be classified. In these other embodiments, the scrutiny classifier may be generated based upon an occurrence probability estimate for the inclusive class set. The scrutiny classifier may be generated from training records corresponding to an inclusive class set encompassing the selected subset of classes.
A classifier that is generated from a set of training records corresponding to two or more classes identified by the scrutiny classifier may be applied to the instance to identify at least one class to which the instance is determined to most likely belong.
The initial subset of classes may be selected based upon assignment to each of the potential classes a probability estimate of the instance belonging to the class. In some embodiments, the selected subset of classes may consist of a preselected number of potential classes having highest assigned probability estimates. In other embodiments, the selected subset of classes may consist of a number of potential classes having highest assigned probability estimates and a cumulative assigned probability estimate exceeding a preselected threshold.
The probability estimates may be assigned to each potential class by applying to the instance a first-stage classifier, which is referred to herein as a xe2x80x9cballpark classifier.xe2x80x9d The ballpark classifier may be generated from a set of training records corresponding to the entire set of potential classes. The ballpark classifier may be generated, for example, by a Naxc3xafve Bayes inducing algorithm, a decision tree inducing algorithm, a neural network inducing algorithm, or other inducing algorithm.
In some embodiments, the subset of classes may be selected based at least in part upon a prescribed misclassification cost.
In another aspect of the invention, a classification system includes a ballpark classifier and a scrutiny classifier. The ballpark classifier is configured to select from the set of potential classes a subset of two or more classes to which the instance is determined to most likely belong. The scrutiny classifier is configured to identify from the selected subset of classes at least one class to which the instance most likely belongs.
In another aspect, the invention features a computer program residing on a computer-readable medium for causing a processor executing the computer program to classifying an instance into one or more classes that are selected from a set of potential classes. The computer program comprises instructions to: select from the set of potential classes a subset of two or more classes to which the instance is determined to most likely belong; and apply to the instance a scrutiny classifier that is generated from a set of training records corresponding to a class set inclusive of the selected subset of classes to identify at least one class to which the instance most likely belongs.
Other features and advantages of the invention will become apparent from the following description, including the drawings and the claims.