1. Field of Disclosure
The disclosure generally relates to the field of computer science, in particular to decision tree induction for purposes including identifying rules for detecting malicious software.
2. Description of the Related Art
A wide variety of malicious software (malware) can attack modern computers. Malware threats include computer viruses, worms, Trojan horse programs, spyware, adware, crimeware, and phishing websites. Malicious entities sometimes attack servers that store sensitive or confidential data that can be used to the malicious entity's own advantage. Similarly, other computers, including home computers, must be constantly protected from malicious software that can be transmitted when a user communicates with others via electronic mail, when a user downloads new programs or program updates, and in many other situations. The different options and methods available to malicious entities for attack on a computer are numerous.
Conventional techniques for detecting malware, such as signature string scanning, are becoming less effective. Modern malware is often targeted and delivered to only a relative handful of computers. For example, a Trojan horse program can be designed to target computers in a particular department of a particular enterprise. Such malware might never be encountered by security analysts, and thus the security software might never be configured with signatures for detecting such malware. Mass-distributed malware, in turn, can contain polymorphisms that make every instance of the malware unique. As a result, it is difficult to develop signature strings that reliably detect all instances of the malware.
Newer techniques for detecting malware apply rules that make an inference about whether a target computer file is malicious by examining attributes of the target file. The rules are typically derived from decision trees that are developed using decision tree induction algorithms, which develop decision trees based on attributes of a training set. The cost associated with determining an attribute value, often referred to as the computational complexity of the attribute, varies among the attributes used in the decision trees. For example, it is less expensive to determine if a file is digitally signed than it is to monitor some aspect of its run-time behavior. Existing decision tree induction algorithms do not take into account such complexity in constructing decision trees and assume that all attributes are equally accessible. Therefore, the decision trees generated by the existing induction algorithms may rely heavily on a few attributes that are effective but very expensive to determine, thereby causing the system performance to suffer. Accordingly, there is a need for techniques that can construct decision trees taking into account computational complexity of attributes used in the decision trees.