1. Field of the Invention
The present invention relates generally to data mining and more specifically to a classifier and inducer used for data mining.
2. Related Art
Many data mining tasks require classification of data into classes. Typically, a classifier classifies the data into the classes. For example, loan applications can be classified into either "approve" or "disapprove" classes. The classifier provides a function that maps (classifies) a data item (instance or record; records and instances are used interchangeably hereinafter) into one of several predefined classes. More specifically, the classifier predicts one attribute of a set of data given one or more attributes. For example, in a database of iris flowers, a classifier can be built to predict the type of iris (iris-setosa, iris-versicolor or iris-virginica) given the petal length, sepal length and sepal width. The attribute being predicted (in this case, the type of iris) is called the label, and the attributes used for prediction are called the descriptive attributes.
A classifier is generally constructed by an inducer. The inducer is an algorithm that builds the classifier from a training set. The training set consists of records with labels. FIG. 1 shows how an inducer constructs a classifier.
Specifically, FIG. 1 includes a training set 110, an inducer 120 and a classifier 130. The inducer 120 receives the training set 110 and constructs the classifier 130.
Once the classifier is built, its structure can be used to classify unlabeled records as shown in FIG. 2. Specifically, FIG. 2 includes records without labels (unlabeled records) 210, a classifier 220 and labels 230. The classifier 220 receives the unlabeled records 210 and classifies the unlabeled records 210.
Inducers require a training set, which is a database table containing attributes, one of which is designed as the class label. The label attribute type must be discrete (e.g., binned values, character string values, or integers). FIG. 3 shows several records from a sample training set.
Once a classifier is built, it can classify new records as belonging to one of the classes. These new records must be in a table that has the same attributes as the training set; however, the table need not contain the label attribute. For example, if a classifier for predicting iris_type is built, the classifier is applied to records containing only the descriptive attributes. The classifier then provides a new column with the predicted iris type.
In a marketing campaign, for example, a training set can be generated by running the campaign at one city and generating label values according to the responses in the city. A classifier can then be induced and campaign mail sent only to people who are labeled by the classifier as likely to respond, but from a larger population, such as all the U.S. Such mailing can have substantial cost savings.
A well known classifier is the Decision-Tree classifier. The Decision-Tree classifier assigns each record to a class. The Decision-Tree classifier is induced (generated) automatically from data. The data, which is made up of records and a label associated with each record, is called the training set.
Decision-Trees are commonly built by recursive partitioning. A univariate (single attribute) split is chosen for the root of the tree using some criterion (e.g., mutual information, gain-ratio, gini index). The data is then divided according to the test, and the process repeats recursively for each child. After a full tree is built, a pruning step is executed which reduces the tree size.
Generally, Decision-Trees are preferred where serial tasks are involved, i.e., once the value of a key feature is known, dependencies and distributions change. Also, Decision-Trees are preferred where segmenting data into sub-populations gives easier subproblems. Also, Decision-Trees are preferred where there are key features, i.e., some features are more important than others. For example, in a mushroom dataset (a commonly used benchmark dataset), the odor attribute alone correctly predicts whether a mushroom is edible or poisonous with about 98% accuracy.
Although Decision-Tree classifiers are fast and comprehensible, current induction methods based on recursive partitioning suffer from a fragmentation problem. As each split is made, the data is split based on the test and after several levels, there is usually very little data on which to base decisions.
Another well known classifier is the Naive-Bayes classifier. The Naive-Bayes classifier uses Bayes rule to compute the probability of each class given an instance, assuming attributes are conditionally independent given a label.
The Naive-Bayes classifier requires estimation of the conditional probabilities for each attribute value given the label. For discrete data, because only few parameters need to be estimated, the estimates tend to stabilize quickly and more data does not change the model much. With continuous attributes, discretization is likely to form more intervals as more data is available, thus increasing the representation power. However, even with continuous data, the discretization is usually global and cannot take into account attribute interactions.
Generally, Naive-Bayes classifiers are preferred when there are many irrelevant features. The Naive-Bayes classifiers are very robust to irrelevant attributes and classification takes into account evidence from many attributes to make the final prediction, a property that is useful in many cases where there is no "main effect." Also, the Naive-Bayes classifiers are optimal when the assumption that attributes are conditionally independent hold, e.g., in medical practice. On the downside, the Naive-Bayes classifiers require making strong independence assumptions. When these assumptions are violated, the achievable accuracy may asymptote early and will not improve much as the database size increases.
FIG. 4 shows learning curves for the Naive-Bayes and Decision Tree classifiers (a C 4.5 type of decision tree inducer was used) on large datasets from the UC Irvine repository (Murphy & Aha 1996). The learning curves show how the accuracy changes as more instances (training data) are shown to the inducer. The accuracy is computed based on the data not used for training, so it represents the true generalization accuracy. Each point was computed as an average of 20 runs of the algorithm, and 20 intervals were used. The error bars show 95% confidence intervals on the accuracy based on the left-out sample. The top three graphs show datasets where the Naive-Bayes outperformed the Decision-Tree, and the lower six graphs show datasets where the Decision-Tree outperformed the Naive-Bayes. In most cases, it is clear that even with much more data, the learning curves will not cross. While it is well known that no algorithm can outperform all others in all cases, in practice, some algorithms are more successful than others.