Data classification problems are commonly encountered in the technical arts. Examples include determining if tumors are malignant or benign, deciding if an article of manufacture is within tolerance or not, establishing the degree to which a combination of medical tests predicts a disease, classifying the content of an image, determining the relevance or irrelevance of information, and so on.
Given examples of classes or categories, each associated with multiple attributes or properties, the task is to determine regions of attribute space that define the classes. This makes it possible to subsequently categorize newly acquired data into classes based on the values of the attributes or properties of the data when the class membership is not known in advance.
Important aspects of the data as regards ease of classification include the number of classes contained in the data, the number of attributes for each datum, i.e., the dimensionality of the property space, and the nature of the distribution of classes within the property space. Many methods of classification are available. A number of the most useful are reviewed and compared in “A Comparison of Prediction Accuracy, Complexity and Training Time of Thirty-three Old and New Classification Algorithms”, T.-S. Lim, W.-S. Loh and Y.-S. Shih, Machine Learning, v. 40, p. 203-229, 2000.
Four important characteristics of a classifier are the accuracy of classification, the training time required to achieve classification, how that training time scales with the number or classes and the dimensionality of the property space describing the data, and how consistent or robust is the performance of the classifier across different data sets.
One well-established method of classification is linear Fisher discriminant analysis, which is notable for an especially favorable combination of good classification accuracy coupled with consistency across different data sets and a low training time. The last is especially important where classification must occur in real-time or nearly so.
Fisher discriminant analysis defines directions in property space along which simultaneously the between-class variance is maximized and the within-class variance is minimized. In other words, directions in property space are sought which separate the class centers as widely as possible while simultaneously representing each class as compactly as possible. When there are two classes there is a single discriminant direction.
Depending on the dimensionality of the property space, a line, plane or hyperplane constructed normal to this direction may be used to separate the data into classes. The choice of the location of the plane (or its equivalent) along the discriminant coordinate depends on the classification task. For example, the location may be chosen to provide an equal error for classification of both classes. As another example, the location may be chosen to maximize the probability that all instances of a given class are correctly detected without regard to false positive identification of the remaining class. When there are more than two classes Fisher discriminant analysis provides a family of discriminant direction vectors, one fewer in number than the number of classes. Planes can be positioned along these vectors to pairwise separate classes.
A concept related to Fisher discriminant analysis is principal component analysis, otherwise known as the Karhunen-Loeve transform. Its purpose is to transform the coordinates of a multi-dimensional property space so as to maximize the variance of the data along one of the new coordinates, which is the principal component. Unlike Fisher discriminant analysis, the objective is to determine a direction that maximizes the overall variance of the data without regard to the variance within classes. As a result of the transform, initial orthogonal property vectors become resulting orthogonal principal component vectors by rotation. In contrast, however, discriminant vectors are not, in general, orthogonal, having their directions determined by the distribution of class properties. Thus, the vectors defining, on the one hand, the discriminant directions and, on the other, the principal component directions are in general distinct and non-coincident.
Underlying the linear Fisher discriminant analysis is the idea that classes within the data have properties that are normally distributed, i.e. each property has a Gaussian distribution about a mean value. To the extent that the actual property distributions of the data violate this assumption the performance of this classifier degrades. That is especially the case when the distributions are multi-modal, i.e., when a given class is represented my multiple groups or clusters of properties that are well-separated within property space and interspersed with similar clusters representing other classes.
An alternative view of this problem is that a plane (or its equivalent) positioned normal to a discriminant direction is an insufficiently flexible entity to describe the boundary between modes or clusters within property space. The difficulty is readily appreciated with a simple example. Given two classes in a two dimensional property plane, if the class distributions lie on a line such that a single property distribution for class 1 is flanked on either side by distributions for class 2, no single straight line will completely separate the two sets of property distributions.
It is to cope with problems such as this that a wealth of various classifiers has been devised. For example, one technique imposes a classification tree on the data using discriminant analysis to determine the branching at each level of the tree (see “Split Selection Methods for Classification Trees”, W.-Y. Loh and Y.-S. Shih, Statistica Sinica, v. 7, p. 815-840, 1997). However, the optimal estimation of the tree requires considerable extra computation and the method is more than an order of magnitude slower than simple linear Fisher discriminant analysis.
In view of the fact that very few classifiers combine the speed and accuracy of linear discriminant analysis, there is a need to improve the classification accuracy of this classifier for data with complex multi-modal attribute distributions while maintaining a minimal impact on the classification time. Various exemplary methods, devices, systems, etc., disclosed herein aim to address this need and/or other needs pertaining to classification of data such as image data.