Classification is a statistical process used to partition a collection of items (e.g. data samples) into homogeneous classes according to their measurable characteristics, or features. Generally speaking, a typical classifier (i.e. a computerized system for performing classification, and often referring to the classification methodology itself) is first trained to recognize and label key patterns in a set of available training samples, and is then used to predict the class membership of future data.
Classifiers, however, are known to take many different forms and make a variety of assumptions that impact their effectiveness when applied to specific problem domains. Some of the issues that arise from these assumptions include: (a) what is the impact of selecting specific distributional models of the data and error processes (e.g., does performance gracefully degrade as the assumptions become increasingly less valid, or is failure catastrophic)?; (b) is the methodology robust to data degradation, including the effects of noisy, correlated, sparse, or missing data?; (c) does the technique readily accommodate different types of data (e.g., interval-ratio, ordinal, categorical, scalar, non-scalar, etc)?; (e) is there resistance to overtraining?; (f) does the methodology explicitly incorporate identified error costs?; (g) what is the ease of use (e.g., are there extensive parametric tuning requirements)?; (h) are posterior probabilities generated whose presence impact both interpretability and confidence assessment?; (i) how computationally efficient is the technique? (specifically, does it readily scale and effectively accommodate large data sets?; and (j) is training an off-line process only or do in-line variants exists.
While many machine learning approaches to classification have been previously developed that address some of these issues, selection of the ideal classifier still relies heavily upon the problem domain, the nature of the underlying data, and the solution requirements imposed by the analyst or by the problem domain itself. Hence, no classifier can be said to outperform all others in all cases. That said, a classifier and classification methodology that successfully addresses most or all of the aforementioned issues in some fashion is highly desirable for general, practical use. Several previously developed classifiers and classification methodologies include, for example: regularized discriminant analysis (RDA); flexible discriminate analysis (FDA); neural networks; and support vector machines (SVMs).
Random Forest Methodology
One of the most recent advances in classification is the random forest (RF) methodology, which is a non-parametric ensemble approach to machine learning that uses bagging to combine the decisions of multiple classification trees to classify data samples. The random decision forest concept was first proposed by Tin Kam Ho of Bell Labs in 1995 (see [Ho1995]), and later extended and formalized by Leo Breiman, who coined the more general term random forest to describe the classification approach (see [Breiman2001]). As used herein and in the claims, the terms “random forest,” “random forest methodology,” and “RF” refer to the classification concept generally disclosed in the [Breiman2001] reference, and not to the statistical analysis software sold under the trademark RANDOM FORESTS®.
Of the many classifiers that have been developed, few have addressed the aforementioned issues as effectively as the RF, which has been demonstrated to be highly accurate, robust, easy to use, and resistant to overtraining, and to produce posterior class probabilities that enable intuitive interpretation of results. RFs readily address numerous issues that frequently complicate and impact the effectiveness of other classification methodologies leveraged across diverse application domains. In particular, the RF requires no simplifying assumptions regarding distributional models of the feature data and error processes. Thus, there are fewer restrictions on the applications and conditions in which the RF can be effectively applied. Moreover, it easily accommodates different types of data since there are no model parameters that must be estimated from the data. Hence, the RF can be viewed as a nonparametric classification/detection methodology. In modern statistical analysis, this is a highly desirable trait, since parameter estimation is frequently complicated by issues related to data sparseness and imbalance, incorrectly specified models that cause bias or inflated variance, etc. Furthermore, RF is highly robust to overtraining with respect to forest size. As the number of trees in the RF increases, the generalization error, PE*, has been shown to converge and is bounded as follows,
                              PE          *                ≤                                            ρ              _                        ⁡                          (                              1                -                                  s                  2                                            )                                            s            2                                              (        1        )                                s        =                  1          -                      2            ·                          PE              tree              *                                                          (        2        )            where ρ denotes the mean correlation of tree predictions, s represents the average strength of the trees, and PE*tree is the expected generalization error for an individual tree classifier (it is implicitly assumed that ρε[0, 1] and sε(0, 1]).
Bagging. From Eq. (1), it is apparent that the bound on generalization error decreases as the trees become stronger and less correlated. To reduce the mean correlation, ρ, among trees in the forest, growth of the trees is generally randomized by using a technique called bagging, in which each tree is trained on a bootstrapped sample of the original training data set, which is typically referred to as its bagged training set. Even though each bagged training set (i.e. tree training set) contains the same number of samples as the original training data (i.e. forest training set), its samples are randomly selected with replacement and are representative of approximately two-thirds of the original data. The remaining samples are generally referred to as the out-of-bag (OOB) data and are frequently used to evaluate classification performance. In other words, for each tree, a tree training set of size N is randomly sampled (with replacement) from the original forest training set of size N. Thus, each tree will be trained on a set of data representative of approximately two-thirds of the original training set.
Node Splitting. At each node in a typical RF classification tree, m features are randomly selected from the available feature set, D, and the single feature producing the “best” split (according to some predetermined criterion) is used to partition the training data into classes. As stated in the [Breiman2001] reference, small values of m, referred to as the split dimension, relative to the total number of features are normally sufficient for the forest to approach its optimal performance. Large values of m may increase the strength of the individual classification trees, but they also generally induce higher correlation among them, potentially reducing the overall effectiveness of the forest. It is notable that a typical RF node split is a univariate decision, based upon a single feature preferentially selected from a preferably small set of m randomly selected features. Such node splits are locally suboptimal due to the randomness injected by the feature selection scheme. However, this approach encourages diversity among the trees, ultimately improving the classification performance of the forest as a whole. Most efforts to enhance random forests have sought to inject additional randomness into the algorithm while preserving the strength of individual classifiers.
Prediction. Each tree in the forest is grown to the greatest extent possible, i.e. it is grown without pruning until the data at its leaf nodes (i.e. terminal nodes) are homogeneous (i.e. all samples are of a single class), or until some other predefined stopping criterion is satisfied. When the forest has been fully constructed, class predictions are then performed by propagating a new test sample through each tree and assigning a class label, or vote, based upon the leaf node that receives the sample. Typically, the sample is assigned to the class receiving the majority vote, although various voting thresholds may be used to tune the resulting error rates. It is notable that the resulting votes can be viewed as approximately independently and identically distributed (i.i.d.) random variables, and thus, the Laws of Large Numbers imply that the corresponding relative frequencies will converge to the true class-specific probabilities as the number of trees in the forest increases. Moreover, the empirical distribution function from which they are drawn will converge to the true underlying distribution function. Hence, the resulting relative frequencies of votes effectively estimate the true class-specific probabilities and we can threshold upon this distribution to make a classification decision. In other words, the class assignment frequencies resulting from this process can be interpreted as posterior class probabilities.
Random Forest Hybrids and Variants
The desirable characteristics of the random forest paradigm have inspired numerous variations and hybrids of this approach to be developed in an effort to enhance the ensemble classifier, with varying success. Because the performance of the random forest method has been shown to depend wholly upon the strength of its individual trees as classifiers and the correlation among them, as suggested by Eq. (1), enhancements to RF methodology have generally proceeded with an emphasis upon increasing the diversity of the tree classifiers while maintaining a high average strength. Example variations of random forests include, for example: Gini Impurity-based Node Splitting; Rotation Forests (and other techniques that involve a transformation of the feature data prior to building the forest); and CART forests and Logistic regression forests (i.e. forests that use alternative base classifiers).
Gini Impurity-based Node Splitting. As described above for the classical RF method, m features are randomly selected at each node, and the single feature that produces the “best” split of the data is computed. While numerous measures have been used to determine the “best” split (e.g., misclassification error, entropy), one popular criterion for node splitting in RFs is based upon Gini impurity, which measures the homogeneity (i.e., purity) of tree nodes. In training a typical decision tree, the ultimate goal is to partition the data into homogeneous regions that can be assigned a predicted class label. Hence, at a given node t, what is sought is the single feature and threshold that maximize the decrease in Gini impurity, which is given by:
                                          Δ            ⁢                                                  ⁢                                          I                G                            ⁡                              (                                                      x                    j                                    ,                  t                                )                                              =                                                    I                G                            ⁡                              (                t                )                                      -                                                            p                  ^                                tL                            ⁢                                                I                  G                                ⁡                                  (                  tL                  )                                                      -                                                            p                  ^                                tR                            ⁢                                                I                  G                                ⁡                                  (                  tR                  )                                                                    ⁢                                  ⁢        where                            (        3        )                                                      I            G                    ⁡                      (            t            )                          =                              ∑                          i              =              1                        numClasses                    ⁢                                                    p                ^                            ti                        ⁡                          (                              1                -                                                      p                    ^                                    ti                                            )                                                          (        4        )            {circumflex over (p)}ti is the probability of class i estimated from the samples in node t; {circumflex over (p)}tL and {circumflex over (p)}tR are the proportion of data samples in node t that fall into its left and right child nodes, respectively, based on the split induced by the threshold xj; and IG(tL) and IG(tR) are computed as in Eq. (4) for the left and right child nodes, respectively.
Rotation Forests (and other techniques that involve a transformation of the feature data prior to building the forest). The Rotation Forest, described in the [Rodriguez2006] reference, is a random forest variant that uses Principal Component Analysis (PCA) to transform the training data prior to training the forest. Specifically, to create the training data for a single tree classifier, the feature set is randomly split into K subsets, and PCA is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a single tree. Once the training data set for a base classifier has been determined, the classifier is trained as in the conventional RF algorithm. Class prediction is performed using the new transformed feature set.
CART forests, Logistic regression forests (i.e., forests that use alternative base classifiers). Because the ensemble paradigm leveraged by the RF is highly effective (i.e., the RF significantly outperforms a single tree classifier), many variations on this theme have been developed that utilize an alternative base classifier. For example, classification and regression trees (CARTs), support vector machines (SVMs), and logistic regression models have all been incorporated into an ensemble to improve performance (see [Ho1998]). Such efforts have generally met with limited success. Though each of these individual classifiers is more effective than the typical RF tree, this distinction does not guarantee a more effective ensemble classifier.