The invention relates to the binary classification of observed events for constructing decision trees for use in pattern recognition systems. The observed events may be, for example, spoken words or written characters. Other observed events which may be classified according to the invention include, but are not limited to medical symptoms, and radar patterns of objects.
More specifically, the invention relates to finding an optimal or near optimal binary classification of a set of observed events (i.e. a training set of observed events), for the creation of a binary decision tree. In a binary decision tree, each node of the tree has one input path and two output paths (e.g. Output Path A and Output Path B). At each node of the tree, a question is asked of the form "Is X an element of the set SX?" If the answer is "Yes", then Output Path A is followed from the node. If the answer is "No", then Output Path B is followed from the node.
In general, each observed event to be classified has a predictor feature X and a category feature Y. The predictor feature has one of M different possible values X.sub.m, and the category feature has one of N possible values Y.sub.n, where m and n are positive integers less than or equal to M and N, respectively.
In constructing binary decision trees, it is advantageous to find the subset SX.sub.opt of SX for which the information regarding the category feature Y is maximized, and the uncertainty in the category feature Y is minimized. The answer to the question "For an observed event to be classified, is the value of the predictor feature X an element of the subset SX.sub.opt ?" will then give, on average over a plurality of observed events, the maximum reduction in the uncertainty about the value of the category feature Y.
One known method of finding the best subset SX.sub.opt for minimizing the uncertainty in the category feature Y is by enumerating all subsets of SX, and by calculating the information or the uncertainty in the value of the category feature Y for each subset. For a set SX having M elements, there are 2.sup.(M-1) -1 different possible subsets. Therefore, 2.sup.(M-1) -1 information or uncertainty computations would be required to find the best subset SX. This method, therefore, is not practically possible for large values of M, such as would be encountered in automatic speech recognition.
Leo Breiman et al (Classification And Regression Trees, Wadsworth Inc., Monterey, Calif., 1984, pages 101-102) describe a method of finding the best subset SX.sub.opt for the special case where the category feature Y has only two possible different values Y.sub.1 and Y.sub.2 (that is, for the special case where N=2). In this method, the complete enumeration requiring 2.sup.(M-1) -1 information computations can be replaced by information computations for only (M-1) subsets.
The Breiman et al method is based upon the fact that the best subset SX.sub.opt of the set SX is among the increasing sequence of subsets defined by ordering the predictor feature values X.sub.m according to the increasing value of the conditional probability of one value Y.sub.1 of the category feature Y for each value X.sub.m of the predictor feature X.
That is, the predictor feature values X.sub.m are first ordered, in the Breiman et al method, according to the values of P(Y.sub.1 .vertline.X.sub.m). Next, a number of subsets SX.sub.i of set SX are defined, where each subset SX.sub.i contains only those values X.sub.m having the i lowest values of the conditional probability P(Y.sub.1 .vertline.X.sub.m). When there are M different values of X.sub.m, there will be (M-1) subsets SX.sub.i.
As described by Breiman et al, one of the subsets SX.sub.i maximizes the information, and therefore minimizes the uncertainty in the value of the class Y.
While Breiman et al provided a shortened method for finding the best binary classification of events according to a measurement or predictor variable X for the simple case where the events fall into only two classes or categories Y.sub.1 and Y.sub.2, Breiman et al do not describe how to find the best subset of the predictor variable X when the events fall into more than two classes or categories.