1. Field of Endeavor
The present invention relates to classification systems and more particularly to decision trees.
2. State of Technology
U.S. Pat. No. 5,787,425 for an object-oriented data mining framework mechanism by Joseph Phillip Bigus, patented Jul. 28, 1998 provides the following description, “The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely sophisticated devices, capable of storing and processing vast amounts of data. As the amount of data stored on computer systems has increased, the ability to interpret and understand the information implicit in that data has diminished. In the past, data was stored in flat files, then hierarchical and network data base systems, and now in relational or object oriented databases. The primary method for analyzing that data has been to form well structured queries, for example using SQL (Structured Query Language), and then to perform simple aggregations or hypothesis testing against that data. Recently, a new technique called data mining has been developed, which allows a user to search large databases and to discover hidden patterns in that data. Data mining is thus the efficient discovery of valuable, non-obvious information from a large collection of data and centers on the automated discovery of new facts and underlying relationships in the data. The term “data mining” comes from the idea that the raw material is the business data, and the data mining algorithm is the excavator, shifting through the vast quantities of raw data looking for the valuable nuggets of business information. Because data can be stored in such a wide variety of formats and because the data values can have such a wide variety of meanings, data mining applications have in the past been written to perform specific data mining operations, and there has been little or no reuse of code between application programs. Thus, each data mining application is written from scratch, making the development process long and expensive. Although the nuggets of business information that a data mining application discovers can be quite valuable, they are of little use if they are expensive and untimely discovered. Returning to the mining analogy, even if gold is selling for $900 per ounce, nobody is interested in operating a gold mine if it takes two years and $901 per ounce to get it out of the ground.”
The paper “Approximate Splitting for Ensembles of Trees Using Histograms,” by Chandrika Kamath, Erick Cantu-Paz, and David Littau, presented at the 2-nd SIAM International Conference on Data Mining, Crystal City, Va., Apr. 11-13, 2002, indicates that decision trees ensembles are popular classification methods, and there are numerous algorithms to introduce randomization in a tree classifier using a given set of data. The randomization makes each tree in the ensemble different, and their results can be combined using voting to create more accurate classifiers. There are several different ways of introducing randomization in the generation of ensembles of decision trees. The most popular approaches, such as boosting and bagging, use sampling to introduce randomization. The Applicants' invention uses histograms to introduce randomization. The Applicants' invention uses histograms to introduce randomization in the classifier. The idea of using histograms to approximate the split at each node of the tree has been around a long time as a way of reducing the time to create a tree with a very large training set. Instead of sorting all the available data instances at each node and considering potential split points between all the attribute values, the histogram approach creates a histogram and uses the bin-boundaries as potential split points. Since there are fewer bin boundaries than data instances, the approach using histograms is faster than the approach using sorting. The best bin boundary, according to some splitting criterion, is chosen as the split point at that node of the decision tree. In the present invention, this use of histograms is extended further, and randomization is introduced at each node of the tree by considering an interval around the best bin-boundary and randomly selecting a point in this interval as the split point. This randomization makes each tree in the ensemble different and their results can be combined using voting to create more accurate classifiers. The resulting ensemble is competitive in accuracy and can be superior in computational cost to traditional approaches for creating ensembles based on boosting and bagging. The paper “Approximate Splitting for Ensembles of Trees Using Histograms,” by Chandrika Kamath, Erick Cantu-Paz, and David Littau, presented at the 2-nd SIAM International Conference on Data Mining, Crystal City, Va., Apr. 11-13, 2002, is incorporated herein by this reference.