Classification problems arise naturally in a variety of fields ranging from engineering and medicine to physics, chemistry, and biology. In most of the problems the number of the observed features may exceed hundreds or even thousands. Hence, it is inevitable that these features contain information which is either redundant or irrelevant to the classification task.
Since the classifier is constructed from a finite training-set, which is often small, it is important to base the classifier only on those features whose relevance to the classification task has been clearly demonstrated by the training-set. If less relevant features are included, the danger is that the classifier will be too finely tuned to the training-set, and as a result its performance on future data will drastically deteriorate.
Also there is a need to select a minimum number of relevant features motivated by economical and practical reasons; the smaller the number of features selected, the smaller is the number of measurements to be made, stored, and processed, and hence the less costly and complicated will be the classification procedure.
The problem of choosing the best subset of features has been extensively investigated (see e.g. Kanal, L. (1974): "Patterns in Pattern Recognition," IEEE Trans. on Information Theory, Vol. 20, pp. 697-722). It is well known that the optimal selection rule involves an exhaustive search over all the possible feature subsets. A direct search or even more sophisticated searches using techniques such as branch and bound, Narendra, P. M. and and Fukunaga, K. (1977): "A Branch and Bound Algorithm for Feature Subset Selection," IEEE Trans. on Computers, Vol. 26, pp. 917-922, or dynamic programming, Chang, C. Y. (1973): "Dynamic Programming as Applied to Subset Selection in Pattern Recognition Systems," IEEE Trans. on Systems, Man and Cybernetics. Vol. 3, pp. 166-171, are beyond the capabilities of present day computers even for a moderate number of features. For this reason, suboptimal search procedures have been proposed. In these schemes, the best feature subset is constructed sequentially by updating the current feature subset by only one feature at a time until a chosen criterion is minimized. If the starting point is the empty set the method is referred to as "bottom up", while if the starting point is the complete set of features, the method is referred to as "top down". It should be emphasized, however, as pointed out by Elashoff, J. D., Elashoff, R. M. and Goldman, G. E. (1967): "On the Choice of Variables in Classification Problems with Dichotomous Variables," Biometrika, Vol. 54, pp. 668-670; Toussaint, G. T. (1971): "Note on Optimal Selection of Independent Binary-Valued Features for Pattern Recognition," IEEE Trans. on Information Theory, Vol. 17, p. 617; and Cover, T. M. (1974): "The Best Two Independent Measurement Are Not the Two Best," IEEE Trans. on Systems, Man and Cybernetics, Vol. 4, pp. 116-117, that sequential selection cannot guarantee optimality of the feature subset even if the features are statistically independent. Moreover, as pointed out by Cover, T. M. and Van Campenhout, J. M. V. (1977): "On the Possible Ordering in the Measurement Selection Problem" IEEE Trans. on Systems, Man and Cybernetics, Vol. 7, pp. 657-661, it is theoretically possible that these suboptimal schemes will yield even the worst possible set of features. However, such extreme cases are undoubtedly rare, and sequential approaches appear to work quite well.
One sequential "bottom up" approach that leads to a very simple classification scheme is based on a decision tree. In this tree structure classification scheme, a sequence of tests, determined by a path in the tree that starts at the root, is performed. The path taken at each node depends on the result of the test performed at that node. When a terminal node is reached the object is assigned the class associated with that node.
Three major problems are encountered in constructing a tree structured classifier. The first problem is how to grow the tree; that is, how to choose the features to be used for "splitting" the nodes. The second problem is how to prune the tree; that is, how to choose the terminal nodes of the tree. The third and the easiest problem is how to choose the class-tag to be associated with each terminal node.
Various approaches to the construction of a tree structured classifier exist. All of them solve the last problem identically: the training-set is run through the tree and the number of samples from each class that reach the terminal nodes is counted. The class-tag assigned to a terminal node is the one with the largest count at that node. The first two problems, namely, the growing and pruning of the tree, are far more subtle and do not have a single clear-cut solution. The growing is usually done by different measures such as entropy, Lewis, P. M. (1962): "The Characteristic Selection Problem in Recognition Systems," IRE Trans. on Information theory, Vol. 8, pp. 171-178, and Casey, R. G. and Nagy, G. (1984): "Decision Tree Design Using Probabilistic Model," IEEE Trans. on Information Theory, Vol. 30, pp. 191-199, and Gini index, Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. (1984): Classification and Regression Trees, Wadsworth, International Group, Belmont, Calif., while the pruning is done by estimating the misclassification error, Mabbet, A., Stone, M. and Washbrook, K. ( 1980): "Cross-Validatory Selection of Binary Variables in Differential Diagnosis," Appl. Statist., Vol. 29, pp. 198-204, and Breiman et al (supra, 1984).
There is no prior art method which both determines features for growing a tree and when to prune the tree. So there is always a different criteria for pruning a tree required that the criteria for growing the tree. The entropy based method selects a feature for a node that minimizes the "entropy" of the information to be stored at the node. This entropy method never indicates when it is desirable to prune the tree.
The Gini index tends to find the split making the probability at nodes close to equal. It is a merely intuitive approach to growing a tree with no provable properties. Further the Gini index provides no pruning criteria.
The prior pruning techniques operated by building a complete decision tree using all of the features that are measured and then collapsing the tree by examining each node and minimizing the number of misclassification errors, plus a fudge factor that is subjectively assigned to the problem most often based on the size of the tree. Subjectively assigned fudge factors must be selected by experts in the classification problem involved in order to obtain good results in most cases.
Another problem with this prior approach to pruning is that predictive estimates of misclassification error cannot be obtained from the same training sample that was used to construct the tree. Thus a fresh sample is needed to determine reasonable pruning characteristics. Often this was approached in the prior art by splitting a training sample into two parts, a first part used to construct the tree and a second part to test the tree for pruning. Thus, not only is the prior approach expensive, requiring expert attention to the selection of fudge factors, but it is wasteful of statistics.