Decision trees and other heuristics are commonly used as predictive models to map observations about an item with conclusions about the item's target value. For example, a security-software vendor may use decision trees as predictive models for identifying malicious computer files (“malware”) based on attributes, characteristics, and/or behaviors of the files.
Decision trees and other heuristics may be trained and refined using a corpus of known samples. For example, a security-software vendor may train a malware-detection heuristic by applying the heuristic to a corpus of samples containing known-malicious files and known-legitimate files.
The accuracy of a heuristic is often limited by the size of the corpus of samples used to train the heuristic. As such, heuristics commonly generate false negatives and/or false positives upon being deployed and used in the real world. In order to improve the accuracy of a heuristic, heuristic providers typically: 1) add the misclassified samples to the corpus of samples used to train the heuristic, 2) re-train the heuristic using the modified corpus of samples, and then 3) redeploy the re-trained heuristic.
However, even if a heuristic is re-trained using a corpus of samples that includes misclassified samples gathered from the field, re-trained heuristics commonly produce new false positives upon being redeployed in the field. Because of this, heuristic providers may have to constantly redeploy and retest a heuristic until satisfactory performance is obtained. Unfortunately, the amount of time required to identify misclassified samples, incorporate these misclassified samples into the corpus of samples used to train a heuristic, and then re-train the heuristic may represent a prohibitively costly and/or lengthy undertaking. As such, the instant disclosure identifies a need for systems and methods for quickly and effectively reducing the number of false positives generated by heuristics.