Malicious program files account for a large proportion of advanced persistent threat (APT) attacks in an enterprise. Therefore, in defense against APTs, how to identify and discover a malicious program file in time becomes a highly important technology. However, a large quantity of program files exist in an intranet of large and medium-sized enterprises, and deep analysis of the program files one by one requires a huge workload and is almost impossible. To resolve this problem, a method for classifying program files in an enterprise using behaviors of the program files at runtime is put forward in other approaches. Compared with analysis of each program file, analysis of each category of program file can help to reduce the workload to some extent.
Further, classification of program files in an enterprise is using behaviors of the program files at runtime as feature vectors for machine clustering and classification to classify a large quantity of program files in the enterprise into several program file categories. Therefore, provided that an analyst randomly samples several program files from program files of each category for deep analysis, the analyst can understand whether the program files of this category are suspicious. In addition, minor program files that are unclassified, or not easy to classify, or in a category including few program files can be discovered in time such that analysis is focused on the minor program files, and a malicious program file is discovered in time. In this way, a workload in analyzing a mass of program files can be effectively reduced, and analysis efficiency can be improved.
In the other approaches, classification of the program files is directly using behavior information, such as string information, in behavior sequences of the program files, to form feature vectors for calculation. Because parameter information of paths in various types of behavior information is highly random, there is a big difference between the feature vectors, and a classification effect is poor. Consequently, a similarity between behavior information cannot be effectively used for clustering and classification.