In recent years, there have been proposed various methods for determining whether an executable file used on the OS such as MS-Windows®, Apple OS-X, Linux® and Unix® is malware. For example, in an antivirus system that determines whether an executable file used in MS-Windows, Unix, and the like is malware, two methods, that is, dynamic determination for performing determination by executing the executable file, and static determination for performing determination without executing the executable file are used. When a high speed is particularly required for the determination, the static determination is used. As a method for the static determination, hash value matching determination and pattern matching determination (signature scanning) can be mentioned as a representative method. The hash value matching determination is performed in such a manner that hash values such as MD (Message Digest Algorithm) 5, SHA (Secure Hash Algorithm) 1, and SHA 256 of the existing malware are registered in a database beforehand, and if a hash value of an executable file to be determined matches the hash value registered in the database, the executable file is determined to be the malware. The pattern matching determination is performed in such a manner that specific character strings and byte codes included in the existing malware are registered in a database beforehand, and if an executable file to be determined includes any of the character strings and the byte codes registered in the database, the executable file is determined to be the malware. These methods have an advantage in that an erroneous detection rate (a rate of erroneously determining an executable file that is not malware to be malware) is small. However, it is difficult to detect a subtype of malware obtained by altering the existing malware or a new type of malware.
Therefore, as a method of determining whether an executable file to be determined is a subtype or new type of malware, heuristic determination has been proposed. This method is to define the likelihood of malware based on the past experience and perform determination according to the definition.
As the heuristic determination, various methods using a machine learning technique have been proposed. A technique described in Patent Literature 1 is such that readable character strings included in an executable file are learned beforehand, to determine the likelihood of malware of the executable file based on how many words frequently used in the malware are included in the executable file to be determined.
In machine learning, an executable file (teacher data) is first converted to several sets of parameters, to perform learning based on a machine learning algorithm. The set of parameters is referred to as “feature vector”, or simply as “feature”, and the number of parameters included in the set is referred to as “feature vector dimension”. Conversion of the executable file to the feature vector is referred to as “feature extraction”. As an example of the feature vector, in the technique described in Patent Literature 1, a set of a word name of a word and the number of appearances of the word is the feature vector, and the number of word types is the feature vector dimension.
The determination accuracy does not always increase as the feature vector dimension becomes larger, and on the contrary, the determination accuracy may become worse. This phenomenon has been known as “the curse of dimensionality” (Non Patent Literatures 1 and 2). In a technique described in Patent Literature 3, it is attempted to perform malware determination by machine learning using PE (Portable Executable) header information of an executable file. It is described that the feature vector dimension is decreased by using a method referred to as “dimensional compression”, and better determination accuracy can be acquired. As a method frequently used for the dimensional compression, for example, there is analysis of principal component. This is a method of combining features having a correlation as one feature automatically (for example, two features of human body height and weight are generally in a proportional relation, and thus these two features are combined in one feature. In this example, the feature combined in one can be defined as a body size, for example. However, it is normally difficult to define the feature combined in one.