Classifying entities based on their features is important in many fields. For example in the field of computer security, classifying files and websites as either malicious or benign and/or classifying computing devices as being infected or clean may be vital in protecting personal and sensitive data. Using various machine-learning techniques, training datasets containing examples with known classifications may be used to train classification models to predict the classifications of new observances. To train a classification model, features of the examples with known classifications may be analyzed to derive a function that predicts a new observance's classification based on its features.
The examples within training datasets often have large numbers of features, many of which may be irrelevant to the examples' or new observances' classifications. Unfortunately in many instances, the computational cost of training a classification model may increase polynomially with the number of features used to train the classification model. Additionally, using irrelevant features to train a classification model may result in a classification model that is over fitted and insufficiently generalized to predict the classifications of new observances. For these and other reasons, the selection of a subset of all available features that should be used to train a classification model is generally considered an important step in training most classification models.
Many methods for selecting and ranking the relevance of features exist. Generally, methods for selecting and ranking features are performed in a stepwise fashion and require linear to quadratic number of estimations of a classification model. For example using forward feature selection, the most relevant feature for predicting a classification may be selected from a set of possible features by (1) training classification models using each possible feature and (2) determining which feature's classification model is most accurate. Then, the second most relevant feature may be selected from the remaining features by (1) training classification models using two-feature combinations of the most relevant feature and each remaining feature and (2) determining which remaining feature's classification model is most accurate. The remaining features may be ranked in a similar manner. Because of the quadratic nature of many feature-selection methods, their use in selecting and ranking more than a small number of features may be prohibitively expensive. The instant disclosure, therefore, identifies and addresses a need for improved systems and methods for selecting features for classification.