Malware refers to computer programs able to cause harm using a computer system. Examples include network worms, keyboard spies, computer viruses.
Various antivirus technologies are used to protect users and their personal computers from possible malware. Antivirus software can include various computer threat detection techniques. Signature-based detection and heuristic detection are among some well-known examples. However, situations are possible where such techniques are not effective. Present-day malicious software is often designed with features that can counter software emulation (for example, the use of undocumented functions, analysis of function performance, e.g., checking whether certain processor flags have been set after the completion of functions, or analysis of returned error codes), as well as polymorph packagers. In this case, packaging of an executable file means compression of the file with an addition of unpacking instructions into the body of the file. If a polymorph packager is used, several files similar to the same malicious file are generated, which are identical by functionality but have different structures. In this case, the efficiency of the signature-based detection will not be high, because a unique signature will be needed for each file from the set of polymorph copies.
There is an alternative method for detecting malicious software, where, during the analysis of a file, various metadata of that file are considered. The metadata can vary depending on the type of the file in question. For instance, in the case of an executable file, such metadata can include the position of the file's sections, the size of the sections or other information from the header of the executable file, or string data extracted from the file. The following derived data can also be considered as metadata: information entropy (a measure of file information uncertainty), frequency characteristics of machine instructions or certain bytes. In case of a sufficiently large set of files and file attributes, a classification system is formed, which, when trained using a set of specified files, is able to compare a set of attributes highlighted in a file to some category of files, for example, the malicious software files category.
There are known solutions that detect malicious software based on metadata extracted from software files. For example, the invention described in U.S. Pre-Grant Pub. No. 2009/0300761 proposes a method for detecting malicious software files using an analysis of file metadata, in particular, information on the packager or unique strings. In this publication, sequences of pseudocode instructions are generated on a per-file basis and frequency characteristics of the instructions can be used as metadata. From the metadata highlighted in the file, an “intelligent hash” is generated; this is a hash which represents a string storing a set of file attributes needed for detecting malicious software files. The invention described in U.S. Pre-Grant Pub. No. 2010/0192222 proposes to additionally use the results of emulation and behavior analysis as metadata.
An important requirement for detecting malicious files based on highlighting metadata involves minimizing false positives. In known approaches, a file's degree of similarity to a pre-grouped set of files is calculated based on various measures of similarity, or “distances,” between the sets of file metadata and file group metadata. This approach places a significant load on the computer network infrastructure that provides the data connection between a user's computer and a file metadata database containing pre-generated sets of files, particularly in cases where a large sets file metadata are to be transmitted for analysis using the computer network.
Although the above-described approaches are aimed at solving certain tasks in the field of protection against computer threats, they have a drawback in that they do not allow to achieve the desired quality of malicious files detection. More generally, there is a need for an effective and efficient solution for automatically detecting malware using file classification methods.