Malicious software refers to the software which is installed and run on the users' computers or other terminals without specific indication to the users or the approval from the users and infracts the legal interest of the users, which is one of the main forms for threatening the information safety. In recent years, the variation of malicious software family increases tremendously. In accordance with the statistics of Internet Security Threat Report issued by Symantec Corporation, there are 31.7 million new variations of the malicious software in the year of 2014 and it reaches to 43.1 million variations in the year of 2015, whose year-on-year growth is 36%. Obviously, manual classification method has not been able to effectively response to such mass data and the automatic classification of the malicious software becomes the hot spot of research.
The research against the malicious software mainly include four aspects as follows: the feature extraction and feature expression of the malicious software, the selection and optimization of clustering algorithm and the clustering result evaluation. Yanfang et al. extracted the sample order sequence and frequency through static analysis method and integrated the clustering methods of tf-idf and k-medoids to realize the classification (Automatic malware categorization using cluster ensemble [A].ACM, 2010.95-104.). Cesare et al. utilized the information entropy to test if the malicious software has been added with shell and unshelled the shelling software. Then they extracted the control flow chart as the sample feature from the generated assembly code and realized the classification of the malicious code through the matching algorithm of similar charts (an effective and efficient classification system for packed and polymorphic malware[J].IEEE Transactions on Computers, 2013, 62(6):1193-1206.). Xiaolin Xu et al. realized the online automatic analysis model of mass malicious codes which are based on feature clustering. The model is mainly composed of three parts, which are the feature space building, automatic feature extraction and quick clustering analysis. Therein, the feature space building part puts forward the heuristic code feature space building method which is based on the statistics. The automatic feature extraction part puts forward the sample feature vector quantity description method which is composed of API behavior and code section. The quick clustering analysis part puts forward the quick neighborhood clustering algorithm based on the locality sensitive hashing (LSH, locality-sensitive hashing) (Online Analytical Model of Massive Malicious Code Based on Feature Clustering [J]. Journal on Communications, 2013, 34(8):147-153.). Ahmad Azab et al. used K-NN algorithm for clustering through calculating the blurry Hash value of the binary file. Through experimental comparison, it is found that the blurry Hash value generated by using TLSH (The Trend Locality Sensitive Hash) has better effect (Mining Malware To Detect Variants. IEEE Computer Society [J], 2014:44-53). Guanghui Liang et al. Divided the program activities into 6 kinds: file operation, program behavior, registry behavior, network behavior, service behavior and acquisition of system information. And they used 6 tuples (type, name, input parameter, output parameter, returned value, next calling) to describe the knot of a behavior and finally built a behavior relying chain. Through calculating the jaccard distance, they calculated the similarity for clustering (A Behavior-Based Malware Variant Classification Technique[C]. International Journal of Information and Education Technology [J], 2016, 6(4):291-295).
Taken together, these methods have the defects as follows: Firstly, extraction of the features is not comprehensive enough, which does not conduct the extraction with combination of the dynamic and static analysis on the advantages of each one. The expression of features either relies too much on the manpower or conducts deletion and reduction through statistics. At the same time, as the dimension is too high, it will rely on the slow clustering. Secondly, on the selection of clustering algorithm, the use of clustering K-MEANS that is based on the division cannot recognize the noise and cannot conduct the clustering of any shape as well. However, the K-NN algorithm needs manual tab for the training sample. At last, at the aspect of clustering quality evaluation, it is incomplete to evaluate the advantage or disadvantage of the clustering result with the accuracy and purity only. The result of clustering shall be considered from the aspects of clustering (cluster) number, the number of individuals within the cluster and the matching degree with the actual sample, etc.