The present invention relates generally to computer-implemented artificial intelligence systems. More particularly, the present invention relates to computer-implemented neural networks with classification capability.
Classification using decision trees is a widely used nonparametric method in pattern recognition for complex classification tasks. The decision tree methodology is also popular in machine learning as a means of automated knowledge acquisition for expert or knowledge-based systems. As shown in FIG. 1, a decision tree classifier 50 uses a series of tests or decision functions 54, 56, and 60 to determine the identity of an unknown pattern or object. The evaluation of decision functions 54, 56, and 60 is organized in such a way that the outcome of successive decision functions reduces uncertainty about the unknown pattern being considered for classification. Left branches (e.g., left branch 61) correspond to positive outcomes of the tests at the internal tree nodes. Right branches (e.g., right branch 63) correspond to negative outcomes of the tests at the internal tree nodes.
In addition to their capability to generate complex decision boundaries, it is the intuitive nature of decision tree classifiers as evident from FIG. 1 that is responsible for their popularity and numerous applications. Applications of the decision tree methodology include character recognition, power system monitoring, estimating software-development effort, and top-quark detection in high-energy physics among others.
While on occasional instances a decision tree classifier is determined heuristically, the common approach is to make use of a learning procedure to automatically configure a decision tree using a set of labeled pattern vectors, i.e. training examples or vectors. Several automatic decision tree induction algorithms exist for this purpose in pattern recognition and machine learning literature (for example, see, L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classification and Regression Tree, Wadsworth Int'l Group, Belmont, Calif., 1984; and S. B. Gelfand, C. S. Ravishankar, and E. J. Delp, "An iterative growing and pruning algorithm for classification tree design," IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, pp. 163-174, 1991).
However, most of these decision tree induction algorithms follow the top-down, divide-and-conquer strategy wherein the collection of labeled examples is recursively split to create example subsets of increasing homogeneity in terms of classification labels until predetermined terminating conditions are satisfied.
The top-down decision tree induction methodology basically consists of following components: 1) a splitting criterion to determine the effectiveness of a given split on training examples, 2) a method to generate candidate splits, 3) a stopping rule, and 4) a method to set up a decision rule at each terminal node. The last component is solved by following the majority rule. Different decision tree induction methods essentially differ in terms of the remaining three components. In fact, the differences are generally found only in the splitting criterion and the stopping rule.
Three decision tree induction methodologies in pattern recognition and machine learning literature are:
(1) AMIG (see, I. K. Sethi and G. P. R. Sarvarayudu, "Hierarchical classifier design using mutual information," IEEE Trans. Patt. Anal. Machine Intell., vol. PAMI-4, pp. 441-445, 1982); PA1 (2) CART (see, L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classification and Regression Tree, Wadsworth Int'l Group, Belmont, Calif., 1984); and PA1 (3) ID3 (see, J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81-106, 1986).
AMIG and ID3, both use an information theory based measure, the average mutual information gain, to select the desired partitioning or split of training examples. Given training examples from c classes, and a partitioning P that divides them into r mutually exclusive partitions, the average mutual information gain measure of partitioning, I(P), is given as ##EQU1##
where p(r.sub.i,c.sub.j) and p(c.sub.j /r.sub.i), respectively, are the joint and conditional probabilities and p(c.sub.j) is the class probability. Using the maximum likelihood estimates for probabilities, the above measure can be written as ##EQU2##
where n.sub.j is the number of training examples from class c.sub.j, and n.sub.ij is the number of examples of class c.sub.j that lie in partition r.sub.i. The quantity N is the total of all training examples of which N.sub.i lie in partition r.sub.i. The split of training examples providing the highest value of the I(P) is selected. The CART procedure uses the Gini index of diversity to measure the impurity of a collection of examples. It is given as ##EQU3##
The split providing maximum reduction in the impurity measure is then selected. The advantage of this criterion is its simpler arithmetic.
To determine when to stop top-down splitting of successive example subsets is the other important part of a decision tree induction procedure. The AMIG procedure relies for stopping on the following inequality that specifies the lower limit on the mutual information, I(tree), to be provided by the induced tree ##EQU4##
where p.sub.error is the acceptable error rate. The tree growing stops as soon as the accumulated mutual information due to successive splits exceeds I(tree). CART and ID3 instead follow a more complex but a better approach of growing and pruning to determine the final induced decision tree. In this approach, the recursive splitting of training examples continues till 100% classification accuracy on them is achieved. At that point, the tree is selectively pruned upwards to find a best subtree according to some specified cost measure.
The generation of candidate splits at any stage of the decision tree induction procedure is done by searching for splits due to a single feature. For example in AMIG, CART, and ID3, each top-down data split takes either the form of "is x.sub.i.gtoreq.t?" when the attributes are ordered variables or the form of "is x.sub.i true?" when the attributes are binary in nature. The reason for using single feature splits is to reduce the size of the space of legal splits. For example with n binary features, a single feature split procedure has to evaluate only n different splits to determine the best split. On the other hand, a multifeature split procedure must search through a very large number of Boolean combinations, 2.sup.2.sup..sup.n logical functions if searching for all possible Boolean functions, to find the best split.
Due to single feature splits, the decision tree induction procedures in practice often create large unwieldy trees that translate into production rules or concept descriptions that are not concise and do not generalize well. Another deficiency of single feature splits is their relative susceptibility to noise in comparison with multifeature splits.
In addition to using only a single feature split scheme to determine successive splits of the training examples, the top-down induction procedures have no look-ahead component in their splitting strategy, i.e. the evaluation of the goodness of a split, single or multifeature, at any stage of partitioning does not take into account its effect on future partitions. This is a major drawback which in many instances leads to large decision trees which yield lower classification accuracy and do not clearly bring out the relationships present in the training examples. The set of labeled examples of Table 1 illustrate this point.
x.sub.1 x.sub.2 x.sub.3 f 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1
Using AMIG/ID3 tree induction procedure, feature x.sub.1 or x.sub.3 yield the best splitting measure value. Selecting x.sub.1 and continuing on with the procedure, decision tree 80 of FIG. 2a is obtained. On the other hand, if feature x.sub.2 is selected (which yields the worst splitting measure value at the root node according to the AMIG/ID3 criterion) decision tree 90 of FIG. 2b is obtained. Not only is decision tree 90 of FIG. 2b smaller than decision tree 80 of FIG. 2a, decision tree 90 also brings out clearly and concisely the relationship present in the data.
Although decision trees have been successfully used in many applications as mentioned above, problems exist that hamper their use and performance in many instances. These problems arise, among other reasons, due to the splitting step used in practice in the top-down tree induction process. As a result, several neural network based solutions have been proposed in recent years for the induction of decision tree classifiers (see, M. Golea and M. Marchand, "A growth algorithm for neural network decision trees," Europhysics Letters, vol. 12, pp. 205-210, 1990; and A. Sankar and R. J. Mammone, "Growing and pruning neural tree networks," IEEE Trans. Computer, Vol. 42, No. 3, pp. 291-299, 1993). These solutions are mainly concerned with providing a multifeature split capability to decision tree induction methods through neural learning algorithms.
Despite such efforts, an important weakness of the decision tree induction methodology still remains. This weakness in the methodology is due to the sequential nature of the induction procedure followed in current neural and non-neural decision tree methods, i.e. the successive splits are determined one after the other and none of the splitting criteria in practice has a look-ahead component.
Since the resurgence of artificial neural networks in early eighties, there have been several neural approaches for decision tree classification methodology. These previous approaches are mainly focused at providing multifeature split capability to the decision tree induction process. Many of these previous approaches were motivated by the topology problem of fully connected feedforward networks rather than as solutions to a decision tree induction problem.
A number of the neural approaches to determine multifeature splits in decision trees use Gallant's pocket algorithm for single perceptron training (see, for example, S. I. Gallant, "Optimal linear discriminants," Proc. 8th Int. Conf. Pattern Recognition, pp. 849-852, 1986). The pocket algorithm is a modification of the classical two-class perceptron learning rule that exploits its cyclic behaviour to determine an optimal separating hyperplane, i.e. a hyperplane providing minimum number of misclassifications, regardless of the separability of the training examples. The pocket algorithm consists of applying the perceptron learning rule with a random ordering of training examples. In addition to the current perceptron weight vector, the pocket algorithm maintains another weight vector, the so called pocket vector, that is the best linear discriminator found thus far. Whenever the performance of the current perceptron weight vector, measured in terms of the length of its correct classification streak, exceeds that of the pocket vector, it automatically replaces the pocket vector. This ensures that pocket vector is always the best discriminator at any training instance. The examples of the early work using pocket algorithm for induction of decision trees with multifeature capability include the perceptron tree of Utgoff and the neural tree of Golea and Marchand (see, P. E. Utgoff, "Perceptron trees: A case study in hybrid concept representation," Proc. Nat'l Conf. Artificial Intelligence, pp. 601-606, St. Paul, Minn., 1988; and M. Golea and M. Marchand, "A growth algorithm for neural network decision trees," Europhysics Letters, vol. 12, pp. 205-210, 1990).
Both of these approaches are considered to yield poor induction results when the training data consists of uneven populations. A reason for this lies in the use of correct classification streak as a performance measure for pocketing a weight vector. It has been shown that this performance measure has a tendency to favor a weight vector that consistently misclassifies all training examples of the minority class when the training data consists of uneven populations (see, P. E. Utgoff and C. E. Brodley, "An incremental method for finding multivariate splits for decision trees," Proc. 7th Int. Conf. Machine Learning, pp. 58-65, Austin Tex., June 1990). This drawback of pocket algorithm for decision tree was addressed by Sethi and Yoo (see, I. K. Sethi and J. H. Yoo, "Design of multicategory multifeature split decision trees using perceptron learning," Pattern Recognition, Vol. 27, No. 7, pp., 1994).
Although the use of neural learning has shown how to generate multifeature splits, the problem still remains with respect to the sequential nature of tree growing or the absence of any look-ahead component in determining tree splits.
The present invention overcomes these and other disadvantages found in previous approaches. In accordance with the teachings of the present invention, a computer-implemented apparatus and method is provided for constructing a decision tree for computer-implemented information processing. A tree structure of a predetermined size is constructed with empty internal nodes, including at least one terminal node and at least one internal node. Training vectors of predetermined classification are used to determine splits for each decision tree node. The training vectors include a back propagation component that determines the splits for each internal node. The training vectors include a competitive learning component that controls the number of terminal nodes thereby determining the effective size of the decision tree.
A feature of the present invention is that it generates compact trees that have multifeature splits at each internal node which are determined on global rather than local basis; consequently it produces decision trees yielding better classification and interpretation of the underlying relationships in the data. Moreover, Since the decision making in backpropagation networks is typically considered opaque, the present invention permits classification that has the feature of making apparent its decision making process while classifying an unknown example while maintaining a performance level similar to the backpropagation network.
For a more complete understanding of the present invention, its objects and advantages, reference may be had to the following specification and to the accompanying drawings.