Use of a decision tree including a multiplicity of nodes to classify data is well-known in the art. Decision trees have been studied extensively in the past two decades and employed in many practical applications. For example, one such application is in the area of image and pattern interpretation involving optical character recognition (OCR). The popular use of decision-tree classifiers stems from that the decision tree idea is intuitively apparent, that training of such classifiers is often straight-forward, and that their execution speed is extremely high.
Techniques for devising decision-tree classifiers are described in such papers as: J. Schuermann et al., "A Decision Theoretic Approach to Hierarchical Classifier Design," Pattern Recognition, Mar. 17, 1984, pp. 359-369; I. Sethi et al., "Hierarchical Classifier Design Using Mutual Information," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-4, Jul. 4, 1982, pp. 441-445.
However, the decision-tree classifiers devised according to the traditional techniques often cannot be expanded in complexity without sacrificing their generalization accuracy. The more complex such classifiers are (i.e., the more tree nodes they have), the more susceptible they are to being over-adapted to, or specialized at, the training data which was initially used to train the classifiers. As such, the generalization accuracy of the more complex classifiers is relatively low as they more likely commit errors in classifying "unseen" data, which may not closely resemble the training data previously "seen" by the classifiers.
Attempts have been made to improve the generalization accuracy of the decision-tree classifiers. One such attempt calls for reducing the size of a fully-grown decision tree adopted by a classifier by pruning back the tree. That is, the input data to the classifier does not go through every level of the tree. Rather, after the input data reaches a preselected level of the tree, the classifier is forced to decide its class or probable classes to make the classifier more generalized. Another attempt involves use of probabilistic techniques whereby the input data descends through multiple branches of the tree with different confidence measures. Although the generalization accuracy on the unseen data improves in the above attempts, the improvement often comes at the expense of the accuracy in classifying the seen training data, which would otherwise be classified with 100% correctness.
Recently, decision-tree classifiers including multiple trees were devised by combining trees which were generated heuristically. The designs of these classifiers rely on an input of an ensemble of features representing the object to be classified. One such multiple-decision-tree classifier is described: S. Shlien, "Multiple Binary Decision Tree Classifiers," Pattern Recognition, vol. 23, no. 7, 1990, pp. 757-763. Each tree in this classifier is designed based on a different criterion directed to a measure of information gain from the features. The criteria used in the tree design include Komogorov-Smirnov distance, Shannon entropy measure, and Gini index of diversity. Because of a limited number of such criteria available, the number of trees includable in such a classifier is accordingly limited.
Even though it is required that the complete ensemble of features be input to each tree in the prior-art multiple-decision-tree classifiers, the actual number of features used in the tree is oftentimes a fraction of the total number of features input. An effort to increase the number of trees in a classifier by utilizing as many available features as possible is described in: S. Shlien, "Nonparametric classification using matched binary decision trees," Pattern Recognition Letters, Feb. 13, 1992, pp. 83-87. This effort involves a feature selection process at each tree node where a usage measure for each feature is computed, and weighed against the information gain if that feature is examined at the node. Specifically, given a choice of features providing a large information gain, the feature with a low usage measure is selected.
Another multiple-decision-tree classifier is described in: S. Kwok et al., "Multiple Decision trees," Uncertainty in Artificial Intelligence, 4, 1990, pp. 327-335. In this classifier, trees are designed based on a modified version of the well-known ID3 algorithm. Pursuant to this modified version, a list of tests, which may be used at a tree node for examining the features, are ranked according to the amount of information gain from using the tests. In order to generate different trees, one selects from the ranked list different subsets of tests to replace the tests adopted by top level nodes of a tree constructed in accordance with the traditional ID3 algorithm.
Although it appears that the above prior-art multiple-decision-tree classifiers deliver better classification accuracy than the single-decision-tree classifiers, the designers of the multiple-tree classifiers all struggled and failed to heuristically generate a large number of trees for the classifiers. In addition, the designs of such classifiers do not guarantee that the performance of the classifiers can always be improved by adding trees thereto.