The source code for the preferred embodiment, the example of the classification of engine firing conditions, is included with this application as Microfiche Appendix. The total number of microfiche is two (2) and the total number of frames is three hundred eighteen (318).
The invention is a machine learning method for pattern recognition or data classification. The preferred embodiment described is for detecting or identifying misfires in an automotive engine. The terms xe2x80x9cpattern recognitionxe2x80x9d and xe2x80x9cclassificationxe2x80x9d are substantially synonymous because the underlying problem of pattern recognition is deciding how to classify elements of data as either belonging or not belonging to a pattern.
The method of the invention is a hybrid of two known separate methods, classifier trees and Bayesian classifiers. The invention constructs a mapping from known inputs to known outputs, which is a general quality of machine learning techniques. The machine learning technique used in this invention is a novel hybrid of two existing techniques: binary classifier trees and Bayesian classifiers. A tree is a data structure having nodes connected by branches terminating in leaves. A binary tree is a tree in which the nodes branch to no more than two subnodes or leaves. A classifier tree is a tree used for classification in which the leaf nodes indicate the classification of the data being classified.
During classification operations, one simply traverses the tree according to some tree traversal method. The leaf reached by the traversal contains the classification of the input data, the data to be classified. In fact, it is the usual practice of prior art to store in each leaf only the classification result. Tree structures having only classifications in the leaves are commonly very complex because all the data needed to search and find the leaf bearing the classification must be available for traversal in branches. The requirement for all classification data to be available in the branches leading to the leaves results in extremely complex trees, trees so complex and therefore so difficult to operate that it is a common practice of prior art to simply prune portions of the tree in order to simply operation, despite the fact that such pruning results in at least some level of inaccuracy in classification.
Bayesian classifiers are statistical techniques that take advantage of Bayes"" Law, a well-known statistical equation. Simple Bayesian classifiers model classes of data as Gaussian kernels, multi-dimensional Gaussian probability distributions defined by statistical moments such as means and covariances. Advanced Bayesian classifiers model classes of data as mixtures of kernels, combinations of kernels with statistical weighting factors. Bayesian classifiers use statistics such as means and covariances abstracted from training data to construct kernels and mixtures. Bayesian classifiers use the kernels and mixtures so constructed in conjunction with distance metrics to classify data.
Simple Bayesian classifiers only work well when the training data and classification data are both normally distributed because of their reliance on statistical moments that lose validity as data patterns vary from the traditional bell curve. The normal distribution includes the well-known bell curve with only one peak or xe2x80x9cmode.xe2x80x9d Advanced Bayesian classifiers can work with data that is somewhat multi-modal because their inclusion of multiple weighted kernels in mixtures can have the effect of aligning kernels with pockets of normality in the data. Nevertheless, when the number of modes becomes high, the known techniques for separating data meaningfully into kernels fails because of well-identified singularities in all known algorithms for separating data into kernels. This is a problem because the modality is high for many of the interesting tasks of data classification or pattern recognition.
It is noted in passing that neural networks are also used in the area of data classification and pattern recognition. Neural networks are so different in structure, operation, and detail from tree classifiers and Bayesian classifiers, and therefore so different from the present invention, however, that neural networks are not considered relevant art with respect to the present invention.
The present invention combines the strengths of binary tree classifiers and Bayesian classifiers in a way that reduces or eliminates their weaknesses to provide a strong solution to the problem of classifying multi-modal data. A tree-based approach is very good at handling datasets that are intrinsically multi-modal. The present invention solves the problem of tree complexity by replacing the classification at the leaves with a Bayesian classifier, which means that the intervening branch nodes are no longer required to contain all the data needed to perform the classification. Some of the burden is transferred to the leaf nodes themselves. In addition, the dataset size of the training data used to create the classification mixtures in the leaf nodes has a minimum size which provides an inherent limit on the number of leaves, further limiting tree complexity.
The present invention solves the problem of poor multi-modal performance in Bayesian classifiers by using the tree structure to address modality by reducing the number of different modes presented to the Bayesian classifiers in the leaf nodes. The data is split at branch nodes during the training phase, and the splitting process tends to group the data in modes. By the time data is presented to a Bayesian classifier in a leaf node, therefore, much of the modality has been removed.
In the present invention, the tree structure is used for something trees are very good for: addressing modality, and Bayesian classifiers are used for what they are best at: accurate classification of data of limited modality. The invention combines the strengths of the techniques of the prior art in a novel, inventive fashion that eliminates or greatly reduces their weaknesses. This is particularly important in the example of classifying engine firing conditions because determining engine conditions at medium speeds or low speeds is a very different problem from determining conditions at high speeds: engine data are highly multi-modal.
The present invention relates to a method for machine learning and the preferred embodiment as described involves detecting engine misfires. Engine misfires results in a losses of energy. The feature vector selected for this example therefore needs to include energy changes. The first energy difference should correlate very strongly with the presence or absence of an engine misfire. Detection of misfires is a problem at high engine speeds due to torsional vibrations of the crankshaft and lags in the sampling process. The present invention introduces the analysis of higher order differences to aid in the diagnosis under such conditions and to attain high accuracy using the method of the invention.
In a classifier tree structure, as in any binary tree, nodes are either branch nodes or leaf nodes. Branch nodes have a left and right subtree. A branch node hyperplane is associated with each branch node. The branch node hyperplane is represented by two n-vectors: a point on the plane and the normal to the plane, where n is the dimensionality of the feature vector. A leaf node classifier is associated with each leaf node. A Bayesian classifier is described in terms of statistical xe2x80x9cmixturesxe2x80x9d for each of the classes. There are two classes and therefore two mixtures in the leaf nodes for the engine problem: nominal engine firings and misfires. Each mixture contains one or more multi-variate Gaussian probability distributions, or kernels.
At the beginning of the training component, the root of the tree, the first branch node, is constructed using the entire set of training data. Each branch node splits the training data for that branch node into two subsets. In the invention, two independent subtrees for each subset are constructed.
For the node construction process, a hypothesis is made that the node to be constructed is a leaf node. Given this hypothesis, a leaf node classifier is constructed for all of the training data in the node.
Next, the classifier just constructed is applied to the applicable training data to form a xe2x80x9cconfusion matrixxe2x80x9d. The confusion matrix comprising four numbers: n00, the number of correct identifications of nominal firings; n11, the number of correct identifications of misfires; n01, the number of incorrect identifications of nominal firings as misfires, and n10, the number of incorrect identifications of misfires as nominal firings.
Then the four numbers in the confusion matrix are used to form a single number. In the preferred embodiment, the method of combining the four numbers into a single number is the calculation of Fisher""s kappa statistic:
nox=n00+n01
n1x=n10+n11
nxo=n00+n10
nx1=n01+n11
n=n0x+n1x
nd=n00+n11
  κ  =                    n        ·                  n          d                    -              (                                            n                              0                ⁢                x                                      ·                          n              xo                                +                                    n                              1                ⁢                x                                      ·                          n              x0                                      )                            n        2            -              (                                            n                              0                ⁢                x                                      ·                          n              xo                                +                                    n                              1                ⁢                x                                      ·                          n              x0                                      )            
Finally, the kappa is compared with a kappa threshold. If the result exceeds the threshold, then the hypothesis is true, the node is a leaf node, and the method terminates. If the result does not exceed the threshold, then the node must be a branch node.
A branch node is described by a branch node hyperplane. Using the core of the EM method to build this branch node hyperplane, a two kernel mixture that describes all of the data it (nominal and misfire) in the node is created. The boundary between the two kernels is a hyperellipse. Using this hyperellipse directly as a splitting mechanism would (1) be computationally inefficient, and (2) run into problems at points far from the center of the hyperellipse. Instead, in the present invention, the two kernels constructed by the EM method as an aid in building a computationally efficient splitting hyperplane. The point on the branch node hyperplane is deemed to be the point along the line connecting the centers of the two kernels that minimizes the sum of the Mahanoblis distances between the point and each kernel. The hyperplane normal is the normal to the constant Mahanoblis distance hyperellipse that passes through this point. The training data for the branch node is split into two subsets depending on which side of the branch node hyperplane each element of the training data reside.
The end result of building a leaf node classifier is a pair of xe2x80x9cmixtures,xe2x80x9d one for the nominal data and another for the misfire data. Using the EM algorithm to form each mixture, four types of data can be used:
(1) Observable dataxe2x80x94the feature vectors
(2) Model parametersxe2x80x94the Gaussian kernel descriptions
(3) Unobservable dataxe2x80x94fuzzy membership of the feature vectors in the Gaussian kernels, and
(4) Combined likelihoods of all of the feature vectors with respect to the mixture.
The method of the invention maximizes the latter piece of unobservable data, the combined likelihoods, by repeatedly building the model parameters based on estimates of the unobservable data.
To constrain the method, limitations are inserted for:
(a) kernels in the mixture to some range (e.g., 2 to 4 kernels)
(b) attempts at building a mixture (e.g., 20 attempts), and
(c) iterations inside the EM method itself.
The tree traversal method in classification operations entails determining on which side of a branch node hyperplane the n-dimensional point to be classified, that is, the feature vector, resides. This representation of the branch node hyperplane makes this sidedness calculation very simple, calculate the inner product of the hyperplane normal with the difference between the hyperplane point and the feature vector to be classified. The algebraic sign of the result indicates on which side of the branch node hyperplane the point to be classified resides. If the result is negative, the tree traversal method takes the left subtree, and the right subtree otherwise. This method is applied repeatedly until a leaf node is reached.
The leaf node classifier at the leaf node is used to compute the likelihoods that the feature vector in question represents a nominal engine firing and a misfire. The classification assigned to the point in question is the class with the higher likelihood. Each class (nominal and misfire) has its own independent mixture in a leaf classifier data structure. Each mixture in turn comprises one or more weighted kernels. The log likelihood for each kernel is computed based on the Mahanoblis distance between the point to be classified and the kernel, the size of the kernel, and some normalizing factors. The log likelihood for each mixture is the log of the weighted sum of the kernel likelihoods.
Once the mixture log likelihoods for the two classes have been calculated, the final step of deciding to which class the feature vector to be classified belongs is simple: it is the class with the higher log likelihood value.