1. Field of Endeavor
The present invention relates to classification systems and more particularly to decision trees.
2. State of Technology
U.S. Pat. No. 5,787,425 for an object-oriented data mining framework mechanism by Joseph Phillip Bigus, patented Jul. 28, 1998 provides the following description, “The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely sophisticated devices, capable of storing and processing vast amounts of data. As the amount of data stored on computer systems has increased, the ability to interpret and understand the information implicit in that data has diminished. In the past, data was stored in flat files, then hierarchical and network data base systems, and now in relational or object oriented databases. The primary method for analyzing that data has been to form well structured queries, for example using SQL (Structured Query Language), and then to perform simple aggregations or hypothesis testing against that data. Recently, a new technique called data mining has been developed, which allows a user to search large databases and to discover hidden patterns in that data. Data mining is thus the efficient discovery of valuable, non-obvious information from a large collection of data and centers on the automated discovery of new facts and underlying relationships in the data. The term “data mining” comes from the idea that the raw material is the business data, and the data mining algorithm is the excavator, shifting through the vast quantities of raw data looking for the valuable nuggets of business information. Because data can be stored in such a wide variety of formats and because the data values can have such a wide variety of meanings, data mining applications have in the past been written to perform specific data mining operations, and there has been little or no reuse of code between application programs. Thus, each data mining application is written from scratch, making the development process long and expensive. Although the nuggets of business information that a data mining application discovers can be quite valuable, they are of little use if they are expensive and untimely discovered. Returning to the mining analogy, even if gold is selling for $900 per ounce, nobody is interested in operating a gold mine if it takes two years and $901 per ounce to get it out of the ground.”
The paper “Creating Ensembles of Decision Trees through Sampling,” by Chandrika Kamath and Erick Cantu-Paz, presented at the 33-rd Symposium on the Interface: Computing Science and Statistics, Costa Mesa, Jun. 13-16, 2001, indicates that decision trees ensembles are popular classification methods, and there are numerous algorithms to introduce randomization in a tree classifier using a given set of data. The randomization makes each tree in the ensemble different, and their results can be combined using voting to create more accurate classifiers. Sampling is one way of introducing randomization in the classifier. The traditional methods of creating ensembles of decision trees, such as bagging and boosting, do the sampling at the beginning of the creation of the tree. Thus, each tree in the ensemble is created using a slightly different input data. In the present invention, the randomization is done at each node of the tree by using a sample of the instances at the node to make the decision at the node. The resulting ensemble is competitive in accuracy and can be superior in computational cost to boosting and bagging. The paper “Creating Ensembles of Decision Trees through Sampling,” by Chandrika Kamath and Erick Cantu-Paz, presented at the 33-rd Symposium on the Interface: Computing Science and Statistics, Costa Mesa, Jun. 13-16, 2001, is incorporated herein by this reference.