1. Field of the Invention
The present invention generally relates to the field of automatic pattern classification and, more particularly, to the construction of a classifier of computer files that recognizes computer objects containing or, likely to contain, a computer virus, to the construction of a classifier of documents on a network capable of recognizing documents pertaining to, or likely to pertain to, a selected subject, and to the construction of a classifier capable of recognizing images containing, or likely to contain, a human face, at approximately a given range and orientation.
2. Background Description
The classification of data is required in many applications. Among the most successful classifiers are artificial neural networks. See, for example, D. Rumelhart, J. McClelland, and the PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, MIT Press (1986). The simplest case of such a neural network is a perceptron, or single-layer neural network, as described by M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry, MIT Press (1969). As referred to herein, a perceptron is defined as having a single, real-valued output, which is a linear combination of its several, real-valued inputs. If the perceptron""s input generates an output value exceeding some threshold, it is classified as positive; otherwise, it is negative. The classifier is defined by a choice of the threshold and linear combination. This choice is generally made so as to give the correct classification on all (or as many as possible) of a set of training examples whose classes are known.
Once defined, the classifier may be used to classify new examples. An important issue is to design the classifier in such a way as to minimize the chance that a new example is classified incorrectly. The probability of error may be characterized mathematically, notably by the PAC (Provably Approximately Correct) learning model. See, for example, L. G. Valiant, xe2x80x9cA Theory of the Learnablexe2x80x9d, Communications of the ACM, 27(11):1134-1142 (1984), and M. J. Kearns and U. V. Vazirani, An Introduction to Computational Learning Theory, MIT Press (1994).
One standard approach to training of a perceptron classifier is by an iterative improvement procedure called the perceptron learning algorithm, as described by Minsky and Papert, supra. (The perceptron learning algorithm was generalized to back propagation learning by Rumelhart, McClelland et al., supra, for more complex, multi-layer neural networks.) The algorithm is only guaranteed to converge, in principle, when the training data can be classified perfectly by a perceptron. This is seldom the case. For example, convergence is not possible if there are any errors in the data, if there is not enough information to do perfect classification, or if the classification problem is too complex for the perceptron model.
Even when convergence is possible, learning can be very slow if there is very little xe2x80x9cslackxe2x80x9d between positive and negative examples. Slack is small if there are both positive and negative examples that are very close to any hyperplane dividing the two classes; or equivalently, if, for any choice of the linear combination and threshold, the outputs for some positive and negative examples will fall close to the threshold. The issue of convergence rate was addressed by A. Blum, A. Frieze, R. Kannan, and S. Vempala in xe2x80x9cA Polynomial-Time Algorithm for Learning Noisy Linear Threshold Functionsxe2x80x9d, Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, pp. 330-338, October 1996.
A second difficulty, also addressed by Blum, Frieze, Kannan, and Vempala, Ibid., is how to construct a classifier when the training examples themselves are not perfectly separable by a hyperplane.
A third difficulty is that of over-training. If the number of input variables, or features, is comparable to, or larger, than the number of training examples, then the generalization ability of a classifier is likely to be terrible. The error rate on new examples will be very high. A rule of thumb is that the error rate decreases with the number of features divided by the number of training examples.
It is therefore an object of the invention to provide a method of efficiently constructing classifiers in cases where the number of features is much larger than the number of training examples. In particular, it is an object of the invention, in such cases, to construct a classifier which makes use of a small subset of the full set of features.
It is another object of the invention to provide a method of efficiently constructing classifiers in cases where it is critical to avoid over training.
It is yet another object of the invention to provide a method of efficiently constructing classifiers in cases where data is not perfectly classifiable by any perceptron.
It is still another object of the invention to provide a method of efficiently constructing classifiers in cases where any two of, or all three of these conditions pertain.
A further object of the invention to provide a method of constructing a classifier of computer files that recognizes computer objects containing or likely to contain a computer virus.
Another object of the invention to provide a method of constructing a classifier of computer files that recognizes computer files pertaining to or, likely to pertain to a selected subject.
The present invention is a method of constructing a classifier of data while tolerating imperfectly classifiable data, learning in time polynomial in the size of the input data, and avoiding overtraining. The present invention is suitable for large problems with thousands of examples and tens of thousands of features. Superior results are achieved by the present invention by coupling feature selection with feature weighting, two tasks that are normally done independently of each other. Training is performed by constructing and solving a linear program (LP) or integer program (IP) or other optimization algorithm.
Thus, the present invention is a method of constructing a linear classifier of data. In one preferred embodiment, first a set of real n-vector examples (referred to as examplars) vi, is provided, along with a classification of each as either positive or negative. Each dimension j of the n-dimensional space is a feature of the vector; for example vi, the jth feature""s value is vji.
Next, a linear program (LP) or integer program (IP) is constructed that includes a feature variable wj corresponding to each feature j. The feature variables are used to discover a thick hyperplane, a pair of parallel hyperplanes separating positive examples from negative examples.
In a second preferred embodiment, based on a classification of each feature as either excitatory or inhibitory, the first LP or IP is modified to produce a second LP or IP. The purpose of the second LP or IP is to eliminate extraneous features. In a third preferred embodiment, exception variables are incorporated into the second LP or IP to produce a third LP, IP or mixed integer-linear program. The third LP or IP permits construction of a good classifier for noisy training data where a perfect classifier may not exist.
The second LP or IP may be constructed directly from the input examples and their classifications, without explicitly constructing the first. Likewise, the third LP or IP may be constructed directly, without explicitly constructing the first or second.
An LP or IP technique or other optimization algorithm may then be used to solve the construction, either exactly or approximately. A solution""s exception variables are disregarded. Optionally,feature variables with small values in the solution are set to zero. Finally, the remaining non-zero value variables define the linear classifier.
The classifier constructed according to the invention may be separately incorporated into an application program. For example, a detector of computer viruses may incorporate a classifier according to the method of the invention.