Data mining is an emerging application of computer databases that involves the development of tools for analyzing large databases to extract useful information from them. As an example of data mining, customer purchasing patterns may be derived from a large customer transaction database by analyzing its transaction records. Such purchasing habits can provide valuable marketing information to retailers in displaying their merchandise or controlling the store inventory. Other applications of data mining include fraud detection, store location search, and medical diagnosis.
Classification of data records according to certain classes of the records is an important part of data mining. In classification, a set of example records, referred to as a training set or input data, is provided from which a record classifier will be built. Each record of the training set consists of several attributes where the attributes can be either numeric or categorical. Numeric (or continuous) attributes are those from an ordered domain, such as employee age or employee salary. Categorical attributes are those from an unordered domain such as marital status or gender. One of these attributes, called the classifying attribute, indicates the class to which the record belongs. The objective of classification is to build a model of the classifying attribute, or classifier, based upon the other attributes. Once the classifier is built, it can be used to determine the classes of future records.
A decision-tree classifier is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of examples from the same class. The tree generally has a root node, interior nodes, and multiple leaf nodes where each leaf node is associated with the records belonging to a record class. Each non-leaf node of the tree contains a split point which is a test on one or more attributes to determine how the data records are partitioned at that node. Decision trees are compact, easy to understand and to be converted to classification rules, or to Structured Query Language (SQL) statements for accessing databases.
For example, FIG. 1 shows a training set where each record represents a car insurance applicant and includes three attributes: Age, Car Type, and Risk level. FIG. 2 is a prior art decision-tree classifier created from the training records of FIG. 1. Nodes 20 and 22 are two split points that partition the records based on the split tests (Age&lt;27) and (Car Type in {Sports}), respectively. The records of applicants whose age is less than 27 years belong to the High Risk class associated with node 24. The records of those older than 27 years but have a sport car belong to the High Risk class associated with node 26. Other applicants fall into the Low risk class of node 28. The decision tree then can be used to screen future applicants by classifying them into the High or Low Risk categories.
Prior to interest in classification for database-centric data mining, it was tacitly assumed that the training sets could fit in memory. Recent work has targeted the massive training sets usual in data mining. Developing classification models using larger training sets can enable the development of higher accuracy models. Examples of recent classification systems that can handle large disk-resident datasets include SLIQ and SPRINT. SLIQ is described in the assignee's U.S. patent application for "Method and System for Generating a Decision-Tree Classifier for Data Records," Ser. No. 08/564,694. It is a scalable decision-tree classifier in which the branches of the tree are built in parallel, breadth-first. However, SLIQ is limited by the amount of data that can be classified and, in a parallel computer such as a multi-processor system, it does not take advantage of the system's parallelism to build the decision tree across the processors.
A serial version of SPRINT was described in the assignee's U.S. patent application for "Method and System for Generating a Decision-Tree Classifier Independent of System Memory Size," Ser. No. 08/646,893. Serial SPRINT also constructs the tree in a breadth-first order, but uses a one-time presorting technique to reduce the cost of numeric attribute evaluation. It initially creates a disk-based attribute list for each attribute in the data, where each entry of the list consists of an attribute value, a class label, and a tuple identifier (tid) of the corresponding data tuple. Initial lists for numeric attributes are sorted by attribute value, while those for categorical attributes stay in unsorted order. All attribute lists are initially associated with the root of the classification tree. As the tree is grown and split into subtrees, the attribute lists are also split. By preserving the order of records in the partitioned lists, no resorting is required. FIG. 3 shows an example of the initial sorted attribute lists associated with the root node 0 of the tree for the training data in FIG. 1 and the resulting partitioned lists for the two children (nodes 1 and 2).
To evaluate split points for the lists, serial SPRINT uses histograms to tabulate the class distributions of the records in an attribute list. Splits for a numeric attribute A are of the form {value(A)&lt;x} where x is a value in the domain of A. Splits for a categorical attribute A are of the form {value(A) in X} where X is a subset of domain(A). The split test is chosen to "best" divide the training records associated with a node. Having found the winning split test for a node, the node's attribute lists are split into two to form the attribute lists for the child nodes (nodes 1-2 in FIG. 3). The attribute list for the winning attribute (Age in our example at the root node 0) is partitioned simply by scanning the list and applying the split test to each record. For the remaining "losing" attributes (Car Type in our example) more work is needed. While dividing the winning attribute, a probe structure (bit mask or hash table) on the tids is created, noting the child where a particular record belongs. While splitting other attribute lists, this structure is consulted for each record to determine the child where this record should be placed. If the probe structure is too big to fit in memory, the splitting takes multiple steps. In each step only a portion of the attribute lists are partitioned.
The serial SPRINT method was oriented to a single processor environment and does not take advantage of the resources in a shared-memory multiprocessor system in building the classifier.
A parallel version of SPRINT is described in the assignee's U.S. patent application for "Method and System for Generating a Decision-Tree Classifier In a Multi-Processor System," Ser. No. 08/641,404 (hereinafter parallel SPRINT application). Like other previous parallel classification methods, parallel SPRINT is primarily oriented to distributed-memory (or shared-nothing) parallel machines. In such a machine, each processor has its own memory and local disks, and communicates with other processors only via passing messages. These methods require a significant amount of communication between the processors in order to evaluate a condition for splitting a tree node, thus limiting the number of possible tests that can be considered.
First, the training examples are distributed equally among all the processors. Each processor generates its own attribute list partitions in parallel by projecting each attribute from training examples it was assigned. FIG. 4 shows a typical partitioning of two attribute lists (for node 0 in FIG. 3) over two processors using parallel SPRINT. Each processor is responsible for splitting its own attribute-list partitions in the split phase. However, since different attribute lists belonging to a processor do not correspond to the same set of records, each processor needs to build a probe structure covering all the records in an attribute list. A processor accomplishes this by exchanging its own tids of the winning attribute list and their corresponding left/right labels with that of all other processors. For example, FIGS. 5 and 6 illustrate an example partitioning of two attribute lists over two processors corresponding to tree nodes 1 and 2, respectively (from FIG. 3) using parallel SPRINT.
The parallel SPRINT method would perform poorly on a shared-memory multiprocessor system because its distributed data structures must be updated for every node, causing much communication among the processors. Each processor would be responsible for processing a roughly 1/P fraction of each attribute list, where Pis the number of processors in the system, and all processors synchronously construct one global decision tree, one node at a time. This, in effect, is like simulating a distributed-memory system on an SMP machine, with additional software for memory and disk management. An arbitrary partitioning of the SMP memory to run a distributed-memory algorithm would result in very inefficient use of the SMP memory. Several data structures must be unnecessarily replicated (e.g., probe structure). The communication software of the distributed-memory system will also need to be mapped into SMP communication primitives. Thus, this approach brings much complexity in programming a distributed-memory system without benefiting from any of its scalability advantages.
In parallel SPRINT, the processors simultaneously grow the same node of a tree by working in parallel on disjoint portions of the attribute list of a numeric attribute to find the best split point in their respective portions. While these portions start off being of equal size, as the tree grows and attribute lists are split, the sizes can start varying substantially resulting in a load imbalance. Performing dynamic load balancing to correct this imbalance is very complex and will introduce significant overhead.
Therefore, there remains a need for an efficient method for generating a decision tree classifier in parallel by the processors of a shared-memory multiprocessor system that is fast, compact, and scalable on large training sets.