1. Field of the Invention
The present invention relates to techniques for automatically classifying data. More specifically, the present invention relates to a method and apparatus that reduces the size of a training set for a classification application without substantially affecting the resultant decision boundary.
2. Related Art
Automated systems for classification-type pattern recognition applications, such as system fault identification and computer network intrusion (or denial of service attack) detection, operate by dividing input data into more readily processable subsets. More specifically, such data processing techniques typically employ classification-type pattern recognition mechanisms to divide the available data into two (and sometimes multiple) subsets.
Unfortunately, the computational time required to classify input data using pattern recognition techniques, such as k-Nearest Neighbor (kNN) classifiers, Radial Basis Function (RBF) networks, Least-Squares Support Vector Machines (LSSVM), Multivariate State Estimation Techniques (MSET), and other techniques, increases linearly (or quadratically for some techniques) with the number of training patterns. This computational cost limits the applicability of classification-type pattern recognition for online diagnostics and for offline mining of large (from 10,000 patterns and up) databases.
One technique for reducing the computational time required for input data classification is to reduce the size of the training set. However, training patterns cannot be arbitrarily eliminated. In particular, if the training patterns are not pruned judiciously, bias and inconsistency are introduced in subsequent classification analyses.
The “condensed nearest neighbor” rule is used by some systems to reduce the size of a training set. For example, a system can start with a one-pattern reduced set and sequentially examine the other patterns in the training set, discarding patterns that are correctly classified by the current reduced set and adding patterns that are classified incorrectly to the reduced set. The system then iterates through the discarded patterns until all of the remaining discarded patterns are classified correctly by the reduced training set. Unfortunately, this technique is not “decision-boundary consistent” and does not always find a minimal training set.
One decision-boundary-consistent technique, called the “Voronoi-editing technique,” uses Voronoi diagrams. A Voronoi diagram partitions the input space into regions that are the loci of points in space closer to each data point than to any other data point. The Voronoi-editing technique maintains exactly the original decision boundary of the nearest neighbor decision rule; however, the reduced set produced by the technique is not minimal. Furthermore, this technique requires O(nd/2) operations, which makes it impractical for dimensions higher than four.
An improvement over the Voronoi-editing technique is the “Gabriel-graph-condensing technique,” which constructs the Gabriel graph (a set of edges joining pairs of points that form the diameter of an empty sphere) of the training set. This technique is significantly faster and only requires O(dn3) operations. However, the Gabriel-graph-condensing technique does not preserve the decision boundary.
Another iterative training set reduction technique applies a deletion rule that identifies patterns to be removed, removes the identified patterns, and applies the rule again to the reduced set until no more patterns can be removed. More specifically, the deletion rule can be stated as follow: for each point x, if the number of other points that classify x correctly is greater than the number of points classified by x, then discard point x. Unfortunately, this technique does not preserve the decision boundary and may require excessively long execution times due to its iterative nature.
The above techniques and other ad-hoc techniques suffer from one of the following deficiencies: (1) prohibitively long running time (third order and higher order polynomial in the number and in the dimension of training patterns); (2) inconsistency of the resultant decision boundary obtained on the reduced set (i.e. the decision boundary is different than would have been obtained with the complete set of training patterns); and (3) suboptimal size for the reduced training set (i.e. there exists a smaller subset that results in the same decision boundary as obtained with the complete set of training patterns).
Hence, what is needed is a method and an apparatus for reducing the size of a training set without the above-described problems.