Decision trees and other heuristics are commonly used as predictive models to map observations about an item with conclusions about the item's target value. For example, a security-software vendor may use decision trees as predictive models for identifying malicious computer files (“malware”) based on attributes, characteristics, and/or behaviors of files.
Decision trees and other heuristics may be trained and refined using a corpus of known samples. For example, a security-software vendor may train a malware-detection heuristic by applying and refining the heuristic using a corpus of samples containing known-malicious files and known-legitimate files. In order to maximize the accuracy and efficacy of such heuristics across large user bases, security-software vendors typically form these training corpuses using files and other software components that are prevalent within their user bases.
Unfortunately, because the accuracy of a heuristic is generally limited by the size of the corpus of samples used to train the heuristic, such heuristics may generate false positives upon being deployed and used on end users' machines in the real world. For example, a malware-detection heuristic may falsely classify legitimate administrative utilities (such as network-traffic monitoring tools or the like) on the machine of an IT security analyst if these tools exhibit behaviors and/or attributes that are closely related to behaviors and/or attributes exhibited by known-malicious software components (such as malicious sniffing tools) within the corpus of samples used to train the heuristic.
Heuristic vendors may attempt improve the accuracy of a heuristic by: 1) adding misclassified samples to the corpus of samples used to train the heuristic, 2) re-training the heuristic using the modified corpus of samples, and then 3) redeploying the re-trained heuristic. However, if a security-software vendor attempts to tune or otherwise refine a heuristic based on false positives generated on the machines of certain specific classes of end users (such as IT security analysts), the overall efficacy and/or accuracy of the heuristic may suffer with respect to the larger user base as a whole. For example, a security-software vendor may hamper a malware-detection heuristic's ability to detect malicious network sniffing components by adding legitimate network-traffic monitoring tools to a corpus of samples used to re-train the heuristic.