The present invention generally relates to malware detection and more specifically relates to using a determination of data entropy, ratio of string data to non-string data, and computer instruction disassembly to detect malware inside of data files that should not contain executable code.
A common problem facing information security personnel is the need to identify suspicious or outright malicious software or data on a computer system. This problem typically arises when an attacker uses a malicious piece of software to compromise a computer system. Initial steps taken in response to this kind of situation include attempts to locate and identify malicious software (also known as “malware”, comprised of machine instructions) or data, followed by attempts to classify that malicious software so that its capabilities may better be understood. Investigators and response personnel use a variety of techniques to locate and identify suspicious software, such as temporal analysis, filtering of known entities, and Live Response.
Temporal analysis involves a review of all activity on a system according to date and time so that events occurring on or around a time window of suspected compromise may be more closely examined. Such items might include event log entries; files created, deleted, accessed, or modified; processes that were started or terminated; network ports opened or closed, and similar items.
Additionally a comparison of files on the system being examined against known file patterns may be performed. In this situation, all files on the system may be reviewed and compared against a database of known, previously encountered files. Such comparisons are usually accomplished through use of a cryptographic hash algorithm—a well known mathematical function that takes the data from a file and turns it into a compact numerical representation known as a hash value. A fundamental property of hash functions is that if two hash values generated using the same algorithm are different, then the data used to generate those hashes must also be different. The corollary is that hashes found to match were generated from data that was identical. While the corollary is not always true, hash collisions (identical hashes generated from different input data) for cryptographic hash algorithms are rare such that a hash comparison may be used to determine file equivalence.
An alternative to reviewing static historical data such as files and event logs is Live Response. This technique examines running programs, system memory contents, network port activity, and other system metadata while the computer system is still on and in a compromised state in order to identify how it may have been modified by an attacker.
There are many other techniques that may be employed to identify suspicious activity on a potentially compromised computer system. These techniques often generate a rather large amount of data, all of which must be reviewed and interpreted in order to reach any conclusions. Further complicating this technique is the fact that attackers typically have a good understanding of the techniques used to identify compromised systems. They employ various methods to hide their presence, making the job of an investigator that much more difficult. Some of these techniques include deleting indicators of their entry to a system once it's compromised, such as log file entries, file modification/access dates, and system processes. Attackers may also obfuscate running malware by changing its name or execution profile such that it appears to be something benign. In order to better hide malware or other data stored on disk, attackers may make use of a “packed” storage format. Packing is a technique by which data is obfuscated or encrypted and encapsulated along with a program to perform a decryption/de-obfuscation, and then stored somewhere on a system. For example, a “Packed Executable” is a piece of software that contains an “unpacking” program and a payload of encrypted data. That payload is often malicious software, such as a virus or Trojan Horse. Attackers may also embed malware inside of files that otherwise would not contain executable machine instructions. This packaging serves two purposes—it attempts to hide the attacker's malware in a location that may be easily overlooked by an investigator. It also may be used to dupe a computer user into inadvertently executing the malware, thus compromising their computer system.
One of the fundamental properties of a data set consisting of machine instructions, when compared to human readable data set, is that the randomness, or “entropy” of the data tends to be higher. Techniques for determining data entropy to identify malware are described in U.S. patent application Ser. No. 11/657,541, published as US Pat. Pub.2008-0184367, the disclosure of which is hereby incorporated by reference in its entirety into the present application. While an examination of entropy may provide a useful filter, a measure of entropy alone is not a guaranteed method for identifying executable machine instructions. Moreover, there are drawbacks to using entropy across a block of data. For example, entropy is a global measurement across a data set, returning a single value across that set. This means that a data block may return a low entropy measurement when in fact small sections of that same data may contain very high entropy. This scenario may be true even if the majority of the data block has low entropy.
Thus, there is a need in the art for a technique to derive a robust measurement of entropy in order to detect the presence of malware in a computer system that has been hidden by an attacker inside of data streams that do not normally contain executable machine instructions.