Decision trees and other heuristics are commonly used as predictive models to map observations about an item with conclusions about the item's target value. For example, a security-software vendor may use decision trees as predictive models for identifying malware based on attributes, characteristics, and behaviors of files.
Decision trees may be trained and refined using a corpus of known samples. For example, a security-software vendor may train a decision tree used to identify malware by applying the decision tree to a corpus of samples containing known-malicious files and known-legitimate files.
Unfortunately, the accuracy of a decision tree is often limited by the size of the corpus of samples used to train the tree. As such, decision trees commonly generate false negatives and/or false positives upon being deployed and used in the real world. In order to improve the accuracy of a decision tree, decision-tree providers typically: 1) add the misclassified samples to the corpus of samples used to train the decision tree, 2) retrain the decision tree using the corpus of samples, and then 3) redeploy the retrained decision tree.
Unfortunately, the amount of time required to identify misclassified samples, incorporate these misclassified samples into the corpus of samples used to train a decision tree, and then retrain and redeploy a decision tree may introduce a significant delay, potentially leading to large numbers of misclassifications in the field. Moreover, even if a decision tree is retrained using a corpus of samples that includes misclassified samples gathered from the field, there is no guarantee that this retrained decision tree will exclude the precise combination of behaviors that resulted in the misclassifications that the decision-tree provider hoped to avoid by retraining the tree.