Data mining is an emerging application of computer databases that involves the development of tools for analyzing large databases to extract useful information from them. As an example of data mining, customer purchasing patterns may be derived from a large customer transaction database by analyzing its transaction records. Such purchasing habits can provide valuable marketing information to retailers in displaying their merchandise or controlling the store inventory. Other applications of data mining include fraud detection, store location search, and medical diagnosis.
Classification of data records according to certain classes of the records is an important part of data mining. In classification, a set of example records, referred to as a training set or input data, is provided from which a record classifier will be built. Each record of the training set consists of several attributes where the attributes can be either numeric or categorical. Numeric (or continuous) attributes are those from an ordered domain, such as employee age or employee salary. Categorical attributes are those from an unordered domain such as marital status or gender. One of these attributes, called the classifying attribute, indicates the class to which the record belongs. The objective of classification is to build a model of the classifying attribute, or classifier, based upon the other attributes. Once the classifier is built, it can be used to determine the class of future records.
Some prior art classification methods achieve short training times by creating the classifiers based on decision trees. A decision tree is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of records from the same class. The tree generally has a root node, interior nodes, and multiple leaf nodes where each leaf node is associated with the records belonging to a record class. Each non-leaf node of the tree contains a split point which is a test on one or more attributes to determine how the data records are partitioned at that node. Decision trees are compact, easy to understand and to be converted to classification rules, or to Structured Query Language (SQL) statements for accessing databases.
For example, FIG. 1 shows a training set where each record represents a car insurance applicant and includes three attributes: Age, Car Type, and Risk level. FIG. 2 shows a prior art decision-tree classifier created from the training records of FIG. 1. Nodes 202 and 203 are two split points that partition the records based on the split tests (Age&lt;25) and (Car Type in {Sports}), respectively. The records of applicants whose age is less than 25 years belong to the High Risk class associated with node 204. The records of those older than 25 years but owning a sports car belong to the High Risk class associated with node 205. Other applicants fall into the Low risk class of node 206. The decision tree then can be used to screen future applicants by classifying them into the High or Low Risk categories.
Two of the patent applications filed by Agrawal et al. remove all system memory size limitations in generating decision-tree classifiers. At the same time, they address issues of efficiency and scalability to large training sets that do not fit in system memory. One of patent applications (Ser. No. 641,404) addressed the issue of using a multi-processor system. However, the amount of time spent on I/O from disk to system memory may still be quite excessive.
"Fast Serial and Parallel Classification of Very Large Data Bases," Proc. of the Very Large Database Conference, 1996, by Shafer et al. is referenced herein as SPRINT-paper. This paper describes work related to the two co-pending patent applications filed by Agrawal in 1996 (cited in the Related Applications section).
Therefore, there remains a need to reduce the I/O requirements in order to significantly improve the performance of the classifier.