1. Field of the Invention
The present invention relates to recognition of similar (or slightly modified) data objects, such as files containing variants of malicious objects, file portions and byte sequences (with or without using some preprocessing procedure), and more particularly, to a system and method for a more rapid analysis of non-identical files/objects for presence of malware variants.
2. Description of the Related Art
One of the most important goals of modern antivirus and anti-malware programs is detection and identification of malicious data objects. These can take the form of viruses, rootkits, worms, trojans, and the like. The most common mechanism for detecting such objects is through signature comparison, where bit patterns in the file being analyzed are compared with the bit patterns of known malicious objects. Such detection is only possible if the bitwise comparison is exact.
A common problem faced by vendors of antivirus software is initial identification of a file that potentially—but only potentially—contains a malicious object. This is due to the fact that the same malicious object functionality can be achieved using a variety of methods, given the instruction sets of the most common processors, such as the Intel processor. For instance, consider the following source code fragment, written in C:                int i=0, j=0;        int main( )        {                    if (i==1000)                            return i;                                    else                            return j;                                                }        
and the second object code fragment, which is the C code fragment above compiled into Intel assembly language:
_main  proc nearmov ecx, dword_414184mov  eax, 3E8hcmp  ecx, eaxjz short locret_401124mov eax, dword_414180locret_401124:retn_main  endp
It will be readily seen that minor changes in the C source code fragment, which do not affect the functionality at all, will result in a different compiled object code fragment. For example, when the if-condition changes from “if (i==1000)” to “if (i==999)”, the C code fragment will compile into a different object code fragment. Also, the compiled object code fragment can be manipulated in numerous ways, to disguise the malicious object. For example, swapping i and j in the source code will result in a different object code fragment without changing the functionality. The addition of an NOP (no operation) instructions will result in a different compiled object code fragment. Such NOP instructions can be liberally and randomly sprinkled throughout the object code, producing in a vast variety of executable file signatures, all with the same functionality since the NOP instructions do not actually do anything.
There are other various mechanisms for disguising the malicious objects, such as rearranging the order of a handful of instructions, in the compiled object code where the order of execution of these instructions does not matter. Other mechanisms include using different registers in the object code (for example, in the Intel architecture, e.g., using a register other than the AX register, which may be used by the standard compiler).
Furthermore, by manipulating the compiled object code and in some cases the binary executable code, there might remain virtually no portions of code of sufficient length in the binary executable, to make meaningful comparisons for purposes of virus signature detection.
In sum, the signature method of detection of malicious object presence is relatively unstable, since trivial and nonfunctional variations in the malicious code result in different signatures. The practical consequence of this is that laborious manual intervention is required by an analyst to detect the appearance of a new variant of a virus or malicious object. Such manual labor can be relatively substantial, even when the ultimate conclusion is that the malicious object at issue is already available in the vendor's database, often in numerous variants.
Examples of conventional file comparison techniques in the field of anti-virus applications may be found, e.g., in U.S. Pat. No. 6,990,600, U.S. Pat. No. 6,738,932, U.S. Pat. No. 5,995,982, U.S. Pat. No. 6,021,491, U.S. Pat. No. 5,878,050, U.S. Patent Publication No. 2002/010459. However, all of these approaches rely on exact comparisons of the objects at issue and even a slightly modified variant cannot be detected by these conventional approaches.
Accordingly, there is a need in the art for a system and method for detection of similar data objects and files, and particularly malicious objects, that is sufficiently stable to recognize new variants of existing malicious objects, notwithstanding nonfunctional or cosmetic changes to the files being compared.