Decision trees and other heuristics are commonly used as predictive models to map observations about an item with conclusions about the item's target value. For example, a security-software vendor may use decision trees as predictive models for identifying malicious computer files (“malware”) based on attributes, characteristics, and/or behaviors of the files.
Decision trees and other heuristics may be trained and refined using a corpus of known samples. For example, a security-software vendor may train a malware-detection heuristic by applying the heuristic to a corpus of samples containing known-malicious files and known-legitimate files.
The accuracy of a heuristic is often limited by the size of the corpus of samples used to train the heuristic. As such, heuristics commonly generate false negatives and/or false positives upon being deployed and used in the real world. In order to improve the accuracy of a heuristic, heuristic providers typically: 1) add the misclassified samples to the corpus of samples used to train the heuristic, 2) re-train the heuristic using the modified corpus of samples, and then 3) redeploy the re-trained heuristic.
Unfortunately, many of the machine-learning techniques used to create and train heuristics tolerate a certain degree of error. For example, a malware-detection heuristic that generates less than a 0.01% false-positive rate may be deemed acceptable. Thus, even if a heuristic is re-trained using a corpus of samples that includes misclassified samples gathered from the field, there is no guarantee that this re-trained heuristic will exclude the precise combination of behaviors that resulted in the misclassifications that the heuristic provider hoped to avoid by re-training the heuristic. In order to address this problem, heuristic providers may attempt to modify the underlying algorithms or formulas used to create or train the heuristic, which may represent a prohibitively costly and/or lengthy undertaking. As such, the instant disclosure identifies a need for systems and methods for quickly and effectively reducing the number of false positives generated by heuristics.