Data mining is an emerging application of computer databases that involves the development of tools for analyzing large databases to extract useful information from them. As an example of data mining, customer purchasing patterns may be derived from a large customer transaction database by analyzing its transaction records. Such purchasing habits can provide valuable marketing information to retailers in displaying their merchandise or controlling the store inventory. Other applications of data mining include fraud detection, store location search, and medical diagnosis.
Classification of data records according to certain classes of the records is an important part of data mining. In classification, a set of example records, referred to as a training set or input data, is provided from which a record classifier will be built. Each record of the training set consists of several attributes where the attributes can be either numeric or categorical. Numeric (or continuous) attributes are those from an ordered domain, such as employee age or employee salary. Categorical attributes are those from an unordered domain such as marital status or gender. One of these attributes, called the classifying attribute, indicates the class to which the record belongs. The objective of classification is to build a model of the classifying attribute, or classifier, based upon the other attributes. Once the classifier is built, it can be used to determine the classes of future records.
Classification models have been studied extensively in the fields of statistics, neural networks, and machine learning. They are described, for example, in "Computer Systems that Learn: Classification and Prediction Methods from Statistics," S. M. Weiss and C. A. Kulikowski, 1991. Prior art classification methods, however, lack scalability and usually break down in cases of large training datasets. They commonly require the training set to be sufficiently small so that it would fit in the memory of the computer performing the classification. This restriction is partially due to the relatively small number of training examples available for the applications considered by the prior art methods, rather than for data mining applications. Early classifiers thus do not work well in data mining applications.
In the paper "An Interval Classifier For Database Mining Applications," Proc. of the Very Large Database Conference, August 1992, Agrawal et al. described a classifier specially designed for database applications. However, the focus there was on a classifier that can use database indices to improve retrieval efficiency, and not on the size of the training set. The described classifier is therefore not suitable for most data mining applications, where the training sets are large.
Another desirable property of classifiers is their short training time, i.e., the time required to generate a classifier from a set of training records. Some prior art methods address both the execution time and memory constraint problems by partitioning the data into subsets that fit in the system memory and developing classifiers for the subsets in parallel. The output of these classifiers is then combined using various algorithms to obtain the final classification. Although this approach reduces running time significantly, studies have shown that the multiple classifiers do not achieve the same level of accuracy of a single classifier built using all the data. See, for example, "Experiments on Multistrategy Learning by Meta-Learning," by P. K. Chan and S. J. Stolfo, Proc. Second Intl. Conf. on Information and Knowledge Management, pp. 314-323, 1993.
Other prior art methods classify data in batches. Such incremental learning methods have the disadvantage that the cumulative cost of classifying data incrementally can sometimes exceed the cost of classifying all of the training set once. See, for example, "Megainduction: Machine Learning on Very Large Databases," Ph.D. Thesis by J. Catlett, Univ. of Sydney, 1991.
Still other prior art classification methods, including those discussed above, achieve short training times by creating the classifiers based on decision trees. A decision tree is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of examples from the same class. The tree generally has a root node, interior nodes, and multiple leaf nodes where each leaf node is associated with the records belonging to a record class. Each non-leaf node of the tree contains a split point which is a test on one or more attributes to determine how the data records are partitioned at that node. Decision trees are compact, easy to understand and to be converted to classification rules, or to Structured Query Language (SQL) statements for accessing databases.
For example, FIG. 1 shows a training set where each record represents a car insurance applicant and includes three attributes: Age, Car Type, and Risk level. FIG. 2 shows a prior art decision tree classifier created from the training records of FIG. 1. Nodes 2 and 3 are two split points that partition the records based on the split tests (Age&lt;25) and (Car Type in {Sports}), respectively. The records of applicants whose age is less than 25 years belong to the High Risk class associated with node 4. The records of those older than 25 years but have a sport car belong to the High Risk class associated with node 5. Other applicants fall into the Low risk class of node 6. The decision tree then can be used to screen future applicants by classifying them into the High or Low Risk categories.
As another example of decision-tree classifiers, an efficient method for constructing a scalable, fast, and accurate decision-tree classifier is described in the assignee's pending application "Method and System For Generating a Decision-Tree Classifier For Data Records," Ser. No. 08/564,694 (hereinafter '694 application), U.S. Pat. No. 5,787,274. The method described there effectively handles disk-resident data that is too large to fit in the system memory by presorting the records, building the tree branches in parallel, and pruning the tree using the Description Length (MDL) principle. Further, it forms a single decision tree using the entire training set, instead of combining multiple classifiers or partitioning the data. For more details on MDL pruning, see for example, "MDL-based Decision Tree Pruning," Intl. Conf. on Knowledge Discovery in Databases and Data Mining, pp. 216-221, 1995.
Nevertheless, the method described in the '694 application still has some drawbacks. First, it requires some data per record to stay memory-resident all the time, e.g., a class list containing the attribute values and node IDs. Since the size of this data structure grows in direct proportion to the number of input records, this places a limit on the amount of data that can be classified. Secondly, in a parallel processing environment such as a multi-processor system, the method does not take advantage of the parallelism of the multi-processor system to build the decision tree classifier more efficiently across the processors. Such parallel generation of the classifier would lead to both shorter training times and reduced system memory requirements.
Therefore, there remains a need for an efficient method for generating a decision tree classifier in parallel by the processors of a multi-processor system that is fast, compact, and scalable on large training sets.