1. Technical Field
The present invention generally relates to an apparatus and method for searching for similar malicious code based on malicious code feature information and, more particularly, to an apparatus and method that automatically analyze samples suspected to be malicious, check for similarities between the suspected samples and existing malicious samples, and search for the most similar malicious samples.
2. Description of the Related Art
For recent 10 years, the amount of malicious code (malware) that is discovered every day has rapidly increased from less than 10 times on average per day in the past. Nowadays, malicious code is discovered 3000 or more times on average per day.
However, it is known that most malicious code that is discovered is not a new type of malicious code, but is variant malicious code created by adding some functions to existing malicious code or by artificially forging existing malicious code so as to avoid antivirus scanning.
In particular, a large amount of variant malicious code, which has functions similar or identical to those of existing malicious code but has formats different from them, has appeared for reasons such as the use of an automatic malicious code production tool, the reuse of existing malicious code, or the application of deformation techniques for scanning avoidance.
If all inflowing malicious code is processed in the usual way, the functions thereof must be respectively and newly analyzed, and new antivirus detection patterns must be developed to be applied to antivirus software. This causes problems related to the deterioration of antivirus performance and excessive analysis time.
Therefore, to effectively cope with the increasing amount of malicious code, inflowing malicious code must be classified into new types of malicious code and variant malicious code. When malicious code is determined to be a new type of code, it must be newly analyzed and processed in detail. When malicious code is determined to be variant code, the difference from existing malicious code must be analyzed, so that previously processed portions and remaining portions must be checked, and so that portions that must be processed can be additionally processed. Further, the results of analysis and processing must be stored and used in order to be prepared for malicious code that will occur in the future.
A technique for calculating similarities between a new malicious code sample and existing analyzed malicious code may be performed in the sequence of normalization, comparison factor extraction, and comparison factor comparison and analysis. Here, the extraction of comparison factors may be classified into a dynamic extraction scheme and a static extraction scheme. Such a dynamic comparison factor extraction scheme is a scheme for utilizing pieces of behavioral information, which appear when malicious code is executed using an emulator, as a comparison factor required for similarity calculation. In contrast, a scheme for extracting a comparison factor via static analysis extracts an Application Programming Interface (API) list present in an Import Address Table (IAT) and utilizes the API list as a comparison factor, or extracts a character string and utilizes the character string as a comparison factor. Also, there is research into technology for extracting a Control Flow Graph (CFG) relationship between functions of malicious code and utilizing it as a comparison factor.
In this way, as the amount of malicious code that occurs has rapidly increased, research into automatic malicious code analysis for automatically analyzing a large number of malicious samples has been actively conducted. In particular, since many pieces of malicious code that have recently been detected are determined to be variants of existing malicious samples, demand has also increased for a system that automatically determines whether malicious code is a variant of existing malicious code and whether producers of the malicious code are the same as those of existing malicious code upon automatically analyzing malicious code.
As related preceding technology, Korean Patent Application Publication No. 2011-0088042 discloses technology that can automatically classify and distinguish new malicious code even without analyzing all malicious code samples, the number of which is exponentially increasing.
As another related preceding technology, technology for statically and automatically analyzing malicious code and determining whether samples are malicious samples was published in Jun. 2-3, 2012 in the paper entitled “NOA: An Information Retrieval Based Malware Detection System” (by IGOR SANTOS and three others in Computing and Informatics, VOL. 32, NO 1).