The data classification problem is well known in the literature for its numerous applications in customer segmentation, customer targeting, and marketing. A supervised data classification problem is defined as follows. A set of records is given, each of which contains a set of feature variables and another predefined variable which is referred to as the class variable. This set of records is referred to as the training data. This training data is used to construct models in which the class variable may be predicted in a record in which the feature variables are known but the class variable is unknown. In order to perform this prediction, a model is constructed in which the feature variables are related to the class variable.
Numerous techniques are known which relate the feature variables to the class variable. In order to perform the classification, numerous techniques are known for creating the relationship between the feature variables and the class variables. The techniques for building the relationships include, for example, DNF (Disjunctive Normal Form) Rules, decision trees, and Bayes classifiers, among others, see, e.g., Agrawal R., Ghosh S., Inielinski T., Iyer B., and Swami A., “An Interval Classifier for Database Mining Applications,” Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, Canada 1992; Apte C, Hong S. J., Lepre J., Prasad S., and Rosen B, “RAMP: Rules Abstraction for Modeling and Prediction,” IBM Research Report RC 20271, June 1995; Quinlan J. R., “Induction of Decision Trees,” Machine Learning, Volume 1, Number 1, 1986; Shafer J., Agrawal R., and Mehta M., “SPRINT: A Scaleable Parallel Classifier for Data Mining,” Proceedings of the 22nd VLDB Conference, Bombay, India, 1996; Mehta M., Agrawal R., and Rissanen J., “SLIQ: A Fast Scaleable Classifier for Data Mining,” Proceedings of the Fifth International Conference on Extending Database Technology, Avignon, France, March 1996, the disclosures of which are incorporated by reference herein.
In decision trees, which is the classification model technique to which the present invention is directed, a goal is to construct a hierarchical decomposition of the data. This hierarchical decomposition is used to create partitions of the data in which one of the class variables dominates. Tn order to predict the class variable for a record in which only the feature variables are known, the hierarchical partition in which the record lies is found. The dominant partition is used in order to predict the class variable. Decision trees are further discussed, for example, in J. C. Shafer, R. Agrawal, M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proc. of the 22th Int'l Conference on Very Large Databases, Mumbai (Bombay), India, September 1996; M. Mehta, R. Agrawal and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, March 1996; and M. Mehta, J. Rissanen, and R. Agrawal, “DML-based Decision Tree Pruning,” Proc. of the 1 st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August, 1995, the disclosures of which are incorporated by reference herein.
In order to explain this further, the following example is presented. Consider a set of records comprising the following two feature variables: (i) age, and (ii) whether or not the person is a smoker; and the class variable comprising an indication of whether or not the person suffers from lung cancer. The distribution of records is such that smokers of higher age are more likely to suffer from lung cancer. A decision tree may partition the data in the following way. In the first phase, there are two branches formed; one corresponding to people who smoke, and another corresponding to people who do not. The branch corresponding to the people who smoke is more likely to have a higher percentage of lung cancer patients. At the second level, the tree is divided into two branches; one corresponding to the people with a higher age, and another corresponding to the people with a lower age. Thus, there are a total of four nodes at the lower level of the tree, and each of these nodes has a much lower number of records, but with different proportions of the class variable. Thus, for a given record, in which the class variable is not known, the class variable can be better predicted by finding the percentage of the class variables in the node in which the current record fits. Thus, this above-described method is one in which the decision trees are created by splits (i.e., division of the tree into branches) using a single variable.
However, it would be highly advantageous to have methodologies in which the splits are created by linear combinations of variables which discriminate between the different nodes effectively.