There are numerous machine learning algorithms that may be used to classify data variables, where such algorithms may be trained to produce desired classification decisions from a set of data points. Classification algorithms involve an attempt to predict values of a categorical dependent variable (e.g., class, type, group membership, etc.) from predictor variables. For instance, a classification algorithm may be used to predict who is likely to purchase a product based on characteristics or traits of consumers, whether a particular email or program is malicious, e.g., a virus, SPAM, etc., whether a particular transaction is fraudulent, etc.
A decision tree is a machine learning predictive model used to map variables to a classification output. One classification technique known as a “random forest” involves independently training decision trees, such as classification and regression trees (CART), on a set of data points. A random forest is a “forest” of decision trees, where each tree may be randomly perturbed during training on the data points from the other trees to produce independent trees.
In one implementation, as a tree of the forest is grown, not every predictor is considered each time a split is made. Instead, a randomly selected subset of the predictors is considered, and the best possible split which can be created using one of the randomly selected predictors is used. For instance, if there are M variables or features, then during training “m” of those M features may be selected as input variables to each node of the tree, where m may be significantly less than M, such as the square root of M. Each tree may be fully grown using the CART method, but not pruned.
Once the trees are trained they may be used to classify new data by inputting the data to all the trees. A final classification decision is then based on the separate classification decisions reached by all the decision trees in the “forest”, such as by a majority vote or an average of classification values. Random forests have been observed to provide a substantial performance improvement over a single tree classifier, such as a single CART tree.