1. Field of the Invention.
This invention relates in general to computer implemented classifiers, and in particular to a scalable set oriented classifier.
2. Description of Related Art.
Information technology has developed rapidly over the last three decades. Many companies are known to have accumulated large volumes of critical business data on magnetic medium. It is widely believed that implicit in this business data are patterns that are valuable but not easy to discern.
Data mining is the process of extracting valid, previously unknown, and ultimately comprehensible information from large databases and using it to make crucial business decisions. The extracted information can be used to form a prediction or classification model or to identify relations between rows of a database table.
The classification problem is one in which a large data set (i.e., a training set), consisting of many examples, must be classified. Each example is a row of a table with multiple attributes. Also, each example is tagged with a class label identifying a class on which the examples in the training set are to be classified. The objective of classification is to develop a classifier based on the examples in the training set. The classifier contains a description (model) for each class. The model is used to classify future data for which the class labels are unknown. See L. Breiman et. al., Classification and Regression Trees, Wadsworth, Belmont, 1984, hereinafter Breiman!; J. Ross Quilan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1993, hereinafter Quilan!; S. K. Murthy, On Growing Better Classification trees from Data, Ph.D. thesis, The Johns Hopkins University, 1995, hereinafter Murthy!; J. Catlett, Megainduction: Machine Learning on Very Large Databases, Ph.D. thesis, University of Sydney, 1991, hereinafter Catlett!; each of which is which is incorporated by reference herein.
Improving quality of results and scalability for large data sets are the two problems to solve for classification. Quality is known to be domain specific (e.g., insurance fraud and target marketing). However, there is a need for a generic solution to the problem of scalability.
Although disk and CPU prices are plunging, the volume of data available for analysis is immense. It may not be assumed data is memory resident. Hence, the practical research thrust is for classifiers that are scaleable and accurate. A classifier should scale well, that is, the classifier should work well even if the training set is large and overflows main memory. In data mining applications, it is common to have training sets with several million examples. It is observed in M. Mehta, R. Agrawal, and J. Rissanen, SLIO: A Fast Scaleable Classifier for Data Mining, Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, March 1996, hereinafter Mehta!, which is incorporated by reference herein, that all previously known classification methods do not scale well. Moreover, traditionally, data access has followed "a row at a time" paradigm and scalability has been addressed individually for each operating system, hardware platform, and architecture. Mehta; J. C. Shafer, R. Agrawal, M. Mehta, SPRINT: A Scaleable Parallel Classifier for Data Mining, Proc. of the 22nd International Conference on Very Large Databases, Mumbai (Bombay), India, September 1996, hereinafter Shafer!; IBM Intelligence Miner User's Guide, version 1, Document No. SH12-6213-00, IBM Germany, July 1996, hereinafter IM User's Guide!; each of which is which is incorporated by reference herein.