1. Field of Exemplary Embodiments
Exemplary embodiments relate generally to software code and, more specifically, to systems and methods for detecting malicious executable software code.
2. Description of Background and/or Related and/or Prior Art
Malicious code is “any code added, changed, or removed from a software system to intentionally cause harm or subvert the system's intended function”. Such software has been used to compromise computer systems, to destroy their information, and to render them useless. It has also been used to gather information, such as passwords and credit card numbers, and to distribute information, such as pornography, all without the knowledge of the system's users. As more novice users obtain sophisticated computers with high-speed connections to the Internet, the potential for further abuse is great.
Malicious executables generally fall into three categories based on their transport mechanism: viruses, worms, and Trojan horses. Viruses inject malicious code into existing programs, which become “infected” and, in turn, propagate the virus to other programs when executed. Viruses come in two forms, either as an infected executable or as a virus loader, a small program that only inserts viral code. Worms, in contrast, are self-contained programs that spread over a network, usually by exploiting vulnerabilities in the software running on the networked computers. Finally, Trojan horses masquerade as benign programs, but perform malicious functions. Malicious executables do not always fit neatly into these categories and can exhibit combinations of behaviors.
Excellent technology exists for detecting known malicious executables. Software for virus detection has been quite successful, and programs such as McAfee Virus Scan and Norton AntiVirus are ubiquitous. Indeed, Dell recommends Norton Anti Virus for all of its new systems. Although these products use the word virus in their names, they also detect worms and Trojan horses.
These programs search executable code for known patterns, and this method is problematic. One shortcoming is that we must obtain a copy of a malicious program before extracting the pattern necessary for its detection. Obtaining copies of new or unknown malicious programs usually entails them infecting or attacking a computer system.
To complicate matters, writing malicious programs has become easier: There are virus kits freely available on the Internet. Individuals who write viruses have become more sophisticated, often using mechanisms to change or obfuscate their code to produce so-called polymorphic viruses. Indeed, researchers have recently discovered that simple obfuscation techniques foil commercial programs for virus detection. These challenges have prompted some researchers to investigate learning methods for detecting new or unknown viruses, and more generally, malicious code.
There have been few attempts to use machine learning and data mining for the purpose of identifying new or unknown malicious code. These have concentrated mostly on PC viruses, thereby limiting the utility of such approaches to a particular type of malicious code and to computer systems running Microsoft's Windows operating system. Such efforts are of little direct use for computers running the UNIX operating system, for which viruses pose little threat. However, the methods proposed are general, meaning that they could be applied to malicious code for any platform, and presently, malicious code for the Windows operating system poses the greatest threat.
In an early attempt, Lo et al. conducted an analysis of several programs—evidently by hand—and identified tell-tale signs, which they subsequently used to filter new programs. While we appreciate their attempt to extract patterns or signatures for identifying any class of malicious code, they presented no experimental results suggesting how general or extensible their approach might be. Researchers at IBM'S T.J. Watson Research Center have investigated neural networks for virus detection and have incorporated a similar approach for detecting boot-sector viruses into IBM's Anti-Virus software.
More recently, instead of focusing on boot-sector viruses, Schultz et al. used data mining methods, such as naive Bayes, to detect malicious code. The Schultz et al. article is “Data Mining Methods for Detection of New Malicious Executables,” in Proceedings of the IEEE Symposium on Security and Privacy, pages 38-49, Los Alamitos, Calif., 2001, IEEE Press, the contents of which are incorporated herein by reference. The authors collected 4,301 programs for the Windows operating system and used McAfee Virus Scan to label each as either malicious or benign. There were 3,301 programs in the former category and 1,000 in the latter. Of the malicious programs, 95% were viruses and 5% were Trojan horses. Furthermore, 38 of the malicious programs and 206 of the benign programs were in the Windows Portable Executable (PE) format.
For feature extraction, the authors used three methods: binary profiling, string sequences, and so-called hex dumps. The authors applied the first method to the smaller collection of 244 executables in the Windows PE format and applied the second and third methods to the full collection.
The first method extracted three types of resource information from the Windows executables: (1) a list of Dynamically Linked Libraries (DLLs), (2) functions calls from the DLLs, and (3) the number of different system calls from within each DLL. For each resource type, the authors constructed binary feature vectors based on the presence or absence of each in the executable. For example, if the collection of executables used ten DLLs, then they would characterize each as a binary vector of size ten. If a given executable used a DLL, then they would set the entry in the executable's vector corresponding to that DLL to one. This processing resulted in 2,229 binary features, and in a similar manner, they encoded function calls and their number, resulting in 30 integer features.
The second method of feature extraction used the UNIX strings command, which shows the printable strings in an object or binary file. The authors formed training examples by treating the strings as binary attributes that were either present in or absent from a given executable.
The third method used the hexdump utility, which is similar to the UNIX octal dump (od −x) command. This printed the contents of the executable file as a sequence of hexadecimal numbers. As with the printable strings, the authors used two-byte words as binary attributes that were either present or absent.
After processing the executables using these three methods, the authors paired each extraction method with a single learning algorithm. Using five-fold cross-validation, they used RIPPER to learn rules from the training set produced by binary profiling. They used naive Bayes to estimate probabilities from the training set produced by the strings command. Finally, they used an ensemble of six naive-Bayesian classifiers on the hexdump data by training each on one-sixth of the lines in the output file. The first learned from lines 1, 6, 12 . . . ; the second, from lines 2, 7, 13, . . . ; and so on. As a baseline method, the authors implemented a signature-based scanner by using byte sequences unique to the malicious executables.
The authors concluded, based on true-positive (TP) rates, that the voting naive Bayesian classifier outperformed all other methods, which appear with false-positive (FP) rates and accuracies in Table 1. The authors also presented receiver operating characteristic (ROC) curves, but did not report the areas under these curves. Nonetheless, the curve for the single naive Bayesian classifier appears to dominate that of the voting naive Bayesian classifier in most of the ROC space, suggesting that the best performing method was actually naive Bayes trained with strings.
TABLE 1Results from the study conducted by Schultz et al.MethodTP RateFP RateAccuracy (%)Signature + hexdump0.340.0049.31RIPPER + DLLs used0.580.0983.61RIPPER + DLL function used0.710.0889.36RIPPER + DLL function counts0.530.0589.07Naïve Bayes + strings0.970.0497.11Voting Naïve Bayes + hexdump0.980.0696.88
However, as the authors discuss, one must question the stability of DLL names, function names, and string features. For instance, one may be able to compile a source program using another compiler to produce an executable different enough to avoid detection. Programmers often use methods to obfuscate their code, so a list of DLLs or function names may not be available.
The authors paired each feature extraction method with a learning method, and as a result, RIPPER was trained on a much smaller collection of executables than were naive Bayes and the ensemble of naive-Bayesian classifiers.
There are other methods of guarding against malicious code, such as object reconciliation, which involves comparing current files and directories to past copies; one can also compare cryptographic hashes. One can also audit running programs and statically analyze executables using pre-defined malicious patterns. These approaches are not based on data mining, although one could imagine the role such techniques might play.
Researchers have also investigated classification methods for the determination of software authorship. Most notorious in the field of authorship are the efforts to determine whether Sir Frances Bacon wrote works attributed to Shakespeare, or who wrote the twelve disputed Federalist Papers, Hamilton or Madison. Recently, similar techniques have been used in the relatively new field of software forensics to determine program authorship. Gray et al. wrote a position paper on the subject of authorship, whereas Krsul conducted an empirical study by gathering code from programmers of varying skill, extracting software metrics, and determining authorship using discriminant analysis. There are also relevant results published in the literature pertaining to the plagiarism of programs, which we will not survey here.
Krsul collected 88 programs written in the C programming language from 29 programmers at the undergraduate, graduate, and faculty levels. He then extracted 18 layout metrics (e.g., indentation of closing curly brackets), 15 style metrics (e.g., mean line length), and 19 structure metrics (e.g., percentage of int function definitions). On average, Krsul determined correct authorship 73% of the time. Interestingly, of the 17 most experienced programmers, he was able to determine authorship 100% of the time. The least experienced programmers were the most difficult to classify, presumably because they had not settled into a consistent style. Indeed, they “were surprised to find that one [programmer] had varied his programming style considerably from program to program in a period of only two months”.
While interesting, it is unclear how much confidence we should have in these results. Krsul used 52 features and only one or two examples for each of the 20 classes (i.e., the authors). This seems underconstrained, especially when rules of thumb suggest that one needs ten times more examples than features. On the other hand, it may also suggest that one simply needs to be clever about what constitutes an example. For instance, one could presumably use functions as examples rather than programs, but for the task of determining authorship of malicious programs, it is unclear whether such data would be possible to collect or if it even exists. Fortunately, as we discuss below, a lack of data was not a problem for our project.