Computer programs provide many functions. For example, computer programs help people track business operations, control machinery and equipment, organize personal lives, manage networks, and communicate with other people, to name a few. Over time, computer programs are being developed to provide more and more types of or improved functions.
Computer programs generally are formed from what is called computer code. A number of different constructs for computer code are known in the art. Certain types of computer code, termed source code are configured to be easily read and manipulated by people. Other types of computer code termed machine code are made up of a series of representations configured to be read directly by a computer processor or other component of a computer system.
A person who creates a computer program typically writes the program in source code. After the source code is received by the computer system, a compiler component translates the source code into assembly level code. Another component of a computer system, an assembler, converts the assembly level code into machine code.
In other methods, a computer system generates the computer code directly as machine code and no translation is necessary.
Each construct—e.g., source code or machine code—may be written in one or more languages. Examples of source code languages are C, C++, Java, Fortran, and Javascript, to name a few. Examples of machine code languages include binary languages representing instruction set of Intel x86 architectures, ARM architecture, Java Virtual Machine, python virtual machine, to name a few.
Binary languages generally use only two types of representations, typically the numbers zero and one, repeated in some pattern to convey information. Each pattern may include one or multiple sets of representations, wherein each set of representations has a discrete meaning (e.g., a number, letter, instruction). For example, the set of representations “01000010” means capital letter “B” according to one binary language.
In certain methods using a binary language, each set of representations, known as a bit string, is formed by eight representations, each of which may be a zero or a one, and known as an eight-bit binary code. However, a bit string may be any length of representations, for example, 5, 6, 7, 8, 10, 16, or 32 representations. In such representations, every bit string may be the same length, known as fixed-length binary code, or may have varying numbers of representations in each bit string, known as variable-length binary code. A computer program consists of one or more bit strings of instructions in machine code along with other bits representing data. Such a computer program may be stored in a file on a storage device known as an executable file.
The CPU inside a computer treats a specific pattern of binary code as instruction to perform a specific operation on its registers or the memory. A register is a temporary work area easily accessible within the CPU and are given symbolic names. For instance, registers in Intel x86 architecture are named eax, ebx, ecx, ax, bx, zf, cf, etc. An Intel x86 CPU treats a certain sequence of binary code as command to add the content of registers eax and ebx, and store the result in register eax. In example of application, an executable file can be temporarily stored on a register until the specific operation is commanded for execution.
From time to time, a user may wish to have certain computer code analyzed. More specifically, a user may wish to identify malware, a computer program configured to disrupt computer operation, gather sensitive information, or gain access to private computer systems. Examples of malware include viruses, worms, Trojan horses, ransomware, rootkits, keyloggers, dialers, spyware, adware, malicious BHOs, or rogue security programs. Clearly, it would be beneficial to be able to efficiently analyze computer code to determine whether it contains malware or not.
A user may also wish to have computer code analyzed to assess what version of a program is being used by a computer system. The code may also be analyzed for purposes of preparing and applying a patch, a section of computer code configured to fix or update an existing computer program. Sometimes a patch may be distributed as an executable file, possibly configured to modify the computer code at the binary level or by completely replacing an existing executable file or computer program.
In addition, a user may wish to analyze computer code for purposes of enforcing ownership rights in the code. More specifically, many computer programs are protected by copyright or patent rights. The owner of the computer program may wish to detect and identify any other computers that are copying, distributing, or using the copyright-protected code and/or patent-protected code without the owner's permission.
Some approaches for analyzing computer code have already been developed. However, known approaches for analyzing computer code are typically associated with certain disadvantages or limitations.
Many known approaches for analyzing computer code include starting with a code section of interest, possibly formed from one or more executable files, and attempting to find a match for the code section of interest within the designated code searched, which also may be formed from one or more executable files. However, because the designated code searched may undergo minor changes, a search for identical matches of code has limited value because the results may omit many relevant code sections that are similar, though not identical to the code section of interest.
Code sections may be considered similar if the instructions are identical except for the choice of the registers, such as, eax, ebx, etc. in the Intel x86 architecture. Code sections may also be considered similar if they have instructions in different order but cause the CPU to perform the same end result. Code sections may also be considered similar if they use different instructions but the instructions collectively cause the CPU to perform the same end result. Additionally, code sections may be considered similar also when the instructions effectively produce the same end result, but relied on different memory locations. Similar code sections, such as those just described, are commonly created by compilers as a result of code reordering, register renaming, choice of instructions, and differences in compiler optimizations.
It is known in the art that it is not mathematically possible to develop a method that correctly and accurately determines two code segments to be similar if and only if they are truly similar. Hence, all known methods for comparing sections of code are inherently imprecise. A method may incorrectly determine two code sections to be similar when they are in fact not similar. Such errors are termed false positive. A method may also incorrectly determine two code sections to be different when they are in fact similar. Such errors are termed as false negatives. It is desirable to develop methods that have few false positive errors and that are also computationally efficient.
A known approach to permit analysis of sections of code is based on abstracting the code section of interest. In this approach a code section of interest is first disassembled. Disassembly typically consists of converting the computer code from binary format into an assembly format. The disassembled code—that is, code in assembly format—is decomposed into procedures. A procedure is a sequence of one or more instructions that a CPU may be directed to execute by a “CALL” instruction. The code of a procedure is then analyzed to construct a control flow graph (CFG). A CFG may be a flow chart mapping the order of actions identified in the code, in which each node in the graph represents a code fragment or basic block of code, i.e. a straight-line piece of code without any jumps or jump targets; jump targets start a block, and jumps end a block. Directed edges may be used to represent jumps in the control flow. There are, in most presentations, two specially designated blocks: the entry block, through which control enters into the flow graph; and the exit block, through which all control flow leaves.
There are many approaches known in the art that compare sections of code after decomposing them into procedures, control flow graphs, and blocks. In one method, a cryptographic hash of one or more instructions of the procedure is computed. Sections of code are compared by comparing the hashes. Clearly, such a method permits matching sections of code that have identical procedures containing identical instructions. However, the method cannot match code sections that are similar, differing only on choice of registers, order of instructions, and such variations.
In another approach, similar procedures are located using graph isomorphism. In this approach, corresponding blocks of codes of two procedures of interest are found by computing statistics related to the types of procedural instructions in a block as well as the number of in-degree and out-degree edges of each block in the CFG. By using types of instructions and their statistics, instead of the instructions themselves, such an approach is able to overcome differences due to register renaming and code reordering. Although this approach leads to very efficient comparison, it also creates significantly high false negative errors because this approach uses only the type of instructions, and not the instructions themselves in performing the comparison. There are approaches known to reduce false negative errors using statistical properties of the graph but these methods increase false positive errors.
Accordingly, additional steps to account for small differences in code may be added. In an amended approach, the instructions inside each block of code are first lifted to their operational semantics. The semantics of a segment of code is the effect of that code on the state variables, the registers and memory of the CPU and peripherals, such as, the display, printer, hard drives, etc. Given the content of the state variable before the code segment is executed, the semantics describes their content after the code segment is executed. The semantics of a code segment may be computed by composing the semantics into individual instruction. Such semantics is termed the operational semantics since it captures the intermediate values of state variables. Two code segments may be compared using their operational semantics. Such comparison is very strict since it matches only those code fragments that effect changes in the state variables in exactly the same order. Most often it is desirable to compare code fragments using denotational semantics, the net effect of a code segment on the state variables after the code segment has been executed. In this approach, a theorem prover is used to determine whether the operational semantics of two code fragments have the same net effect, i.e., have the same denotational semantics. It is also desirable to consider two code fragments to compare denotational semantics if the code fragments were modified to consistently rename the registers, such as using the register eax instead of ebx. In this amended approach, such match is determined by using a theorem prover to try all possible permutations of register names to find a match.
This method of determining similar blocks using a theorem prover to determine if the semantics of two code fragments match does not produce any false positive errors. However, the method is computationally expensive, sometimes requiring over 30 minutes of computer time to determine similarity between two code sections. As a result, these methods are not practical for finding similar code sections between very large collections of programs.
Other known systems and methods call for splitting down a larger portion of code into blocks and representing each block using an n-perm or n-gram. These methods are fast as they do not construct CFGs. They also are insensitive to register renaming and code reordering. These methods, however, produce extremely high false positive and false negative errors, rendering them ineffective for large collections of programs.
Clearly, there is a demand for an improved system and methods for comparing two or more sets of code that is not sensitive to variation in code typically introduced by compilers, that is configured to be efficient regarding time and resources necessary to conduct the comparison, and that minimizes inaccuracy in matching or the reported output results. The present invention satisfies these demands.