The process of predictive modeling attempts to predict a most likely outcome for a given starting condition based on a model. Various models can be used in this context, such as the naive Bayes model, the k-nearest neighbor algorithm, logistic regression, etc. Predictive modeling software technologies can use a ground truth set (i.e., a data set comprising members of a known classification) to train a classifier to automatically classify unknown members of an input data set. For example, where the members of the ground truth set are files known to be either infected with malicious code (malicious files) or to be uninfected (benign files), predictive modeling can be used to train a classification engine to classify files of unknown status (the input data set) as malicious or benign.
Classification of files as malicious or benign is just one example. Predictive modeling based classification can also be used in many other contexts, such as, for example, classifying stocks as buy, sell or hold based on the predicted likelihood of changes in their value, or classifying customers based on the likelihood of their future purchase of a given product.
As useful as these techniques can be, frequently many members of the unknown set (e.g., files) cannot be automatically classified with enough certainty to be definitively labeled (e.g., as malicious or benign). Thus, in real world operation, the classifier either must be tuned to err on the side of false negatives or false positives, neither or which are desirable, or else the classifier cannot automatically make a decision in many instances.
Where the classifier can predict that a given classification is more likely than the other(s) but without a sufficient level of certainty to automatically make the classification, one option is to prompt the user for a second level confirmation. For example, in the case of a malicious file classifier, users are often prompted to allow or deny an operation involving a file that the classifier has determined might be a security risk. Some users are very good at making correct decisions in such cases, whereas others are less so. The accuracy with which different users make such decisions ranges from wrong more often than not to usually right, with a full progression of gradations in between.
It would be desirable to address these issues, and increase the accuracy of predictive modeling based automatic classification.