1. Technical Field
The present invention relates generally to data mining and, in particular, to a method for building space-splitting decision trees. The method may be employed in decision-tree based classification where the training set has a very biased data distribution.
2. Description of Related Art
Classification represents a very important data mining problem. Algorithms that build scalable classifiers were first described by J. C. Shafer et al., in “SPRINT: A Scalable Parallel Classifier for Data Mining”, Proceedings of the 22nd Very Large Database (VLDB) Conference, September 1999. The input database, also called training set, consists of a number of records with a fixed number of attributes. Attributes whose underlying domain is totally ordered are called ordered attributes, whereas attributes whose underlying domain is not ordered are called categorical attributes. There is one distinguished categorical attribute which specifies the class label of each record. The objective of classification is to generate a concise and accurate description or a model for each class label in the training set. The model so generated is used to classify future data sets for which the class labels are unknown.
According to the prior art, such as the aforementioned SPRINT algorithm, scalable methods have been created for efficiently learning a classifier using a large training set. However, these methods offer poor predictive accuracy when the training set has a very biased data distribution. Most learning algorithms, such as decision trees, will generate a trivial model that always predicts the majority class and reports an accuracy rate of 99%. However, the users are often only interested in data cases of the biased class. Decision trees have been shown unsuitable for such tasks. Up-sampling the biased class can avoid building trivial models. However, the benefits of up-sampling is limited. A decision tree that assigns an unknown instance a definite class label (“positive” or “negative”) or a probability based merely on the data distribution in the leaf nodes usually has a low predictive accuracy for instances of the biased target class. Furthermore, the model can hardly discriminate among the majority data on the basis of their closeness to the target class. This causes problems in applications such as target marketing: the marketing department wants to send out promotions to 20% of the people in their database while only 1% of the people in the database are recorded buyers. The remaining 19% have to be selected based on their closeness to the buyer class.
Additionally, prior art decision trees are usually very large, which hampers the interpretability of the model. Other methods, such as those described by Broadley et al., in “Multivariate Decision Trees”, Machine Learning, Vol. 19, pp. 45-77, 1995, try to build compact decision trees using multivariate splitting conditions. However, these methods are not applicable to large training sets because performing a multivariate partition often leads to a much larger consumption of computation time and memory, which may be prohibitive for large data sets.
Another shortcoming of the decision tree is that, as a “greedy” process, the decision tree induction process sometimes fails to discover all the patterns in the data.
ForegroundBackgroundClear?whiteblackyeswhiteblackyeswhitewhitenosilverwhitenosilverwhitenosilverwhitenosilverwhitenolight grayblackyeslight graywhitenoblacklight greyyes
Consider the training set above, where “Clear?” is the class label. A prior art decision tree such as that referenced above always makes a split on the “Background” attribute, and partitions the training set into three leaf nodes, each made up of data cases from a single class. However, this seemingly perfect tree totally ignores the “Foreground” attribute, which also provides valuable information of the class distribution. For instance, silver foreground always implies that it is not clear. Given an instance (silver, light grey), the decision tree predicts Yes since its background is “light grey”, although the instance is much closer to pattern (silver, white), which forms a No pattern with strong support.
Thus, there is a need for a method for building a compact decision tree model on a data set. The method must be scalable to deal with large amounts of data frequently found in market analysis.
There is also a need for a method which provides a meaningful model when the underlying data set is very biased and the data records of the rare class are more important to the users. Furthermore, there is a need for a method which can efficiently score the data based on the closeness of the data to the target class.