Malware, such as viruses, worms, Trojans, etc, refers to software that destroys users' computers and infringes their legitimate rights and interests without their permission, In recent years, malware is widespread and uncurbed, having a serious impact on users' work and life. According to a research reportMal of domestic security vendor 360, in 2014, there were a total of 324 million new malicious program samples with an average daily increase of 888 thousand. Malicious program attacks have been intercepted for 57.27 billion times with an average daily interception of 157 million times.
Linux operating system is a completely open operating system. Anyone can obtain source code to carry out secondary development. After years of development, it has become a mature and complete system. Increasingly more individual users start to choose this operating system for daily use and development. With the wide application of Linux operating system, more and more hackers pay attention to this platform. As a result, malware in Linux platform gradually increases. In the past, people believed that Linux operating system was very safe and malware did not exist. However, this recognition has been gradually subverted and the safety problem of Linux platform has become more and more serious.
Researches on Linux platform malware detection are not enough and they are mainly based on feature code. In the current situation, the traditional code feature based detection method constitutes a feature code database, which extracts the feature code of malware, and then compares with the feature code from the feature code database through scanning the software information to obtain the detection conclusion. This method is feasible and effective to detect the known malware, so it is widely used in the existing anti-virus software. The current development of this method lies in improving the accuracy and detection speed of feature code. However, considering the current development of malicious programs, this method cannot detect new malware. The detection lags behind. It is necessary to update feature database constantly. Weaknesses gradually expand.
Some of the new malware detection methods do not use feature codes, but carry out detection based on behavior features or head information of software for comparison. All of these methods constitute an index set by mining malware local information and making use of the index set to classify software, but there are still some deficiencies. The fuzzy and polymorphic malware used in the behavior-feature-based detection has unfixed local features. It is difficult to obtain an accurate result by comparing it to the index set. Therefore, this method does not have high accuracy for determining such malware. The index used in the software-head-information-based detection is software descriptive information. This information cannot reflect the software behavior accurately. For experienced malware developers, it is easy to be modified and confused so that the detection effect of this detection method is significantly reduced.
The Contents of the Invention
The technical problems to be solved by the present invention is to provide a detection method for Linux platform malware, deal with new or unknown malware, the size of feature database and the increase of feature matching time index and the constant update problems that cannot be detected by code-feature-based detection method, and use machine learning methods to detect malicious software.
To solve the above-mentioned technical problems, the technical proposal adopted by the present invention is:
A detection method for Linux platform malware includes the following steps:
Step 1: In the Linux operating system, use objdump-D command to disassemble ELF format benign software and malware samples to generate assembly files;
Step 2: Traverse the generated assembly files one by one, read the code segment of ELF files and identify whether the code segment contains main( ) function in the same time;
Step 3: Analyze the code segment read in step 2. If there is main function in the code segment, then starting from the entry address of main( ) function, otherwise starting from the entry address of the code segment, traverse all assembly instructions and divide assembly code into different basic blocks in accordance with the address in ascending order. Each basic block is marked by its lowest address, and adds the vertex of the control flow graph to the adjacency linked list;
Step 4: Analyze the code segment read in step 2 again. If there is main( ) function in the code segment, then starting from the entry address of main( ) function, otherwise starting from the entry address of code segment in a sequential and recursive way, analyze each branch and jump instruction, ignore indirect jump and branch instructions, confirm the target address of branch and jump instructions, establish the relation between basic blocks, and add the edges of control flow graph to the adjacency linked list and meanwhile determine the type of basic blocks to generate a basic control flow graph according to the address ascending order and the construction rules of control flow graph;
Step 5: Extract the features of the control flow graph generated in step 4, and write all the features extracted from the samples into ARFF file;
Step 6: Take ARFF file generated in step 5 as the data set of a machine learning tool named weka, carry out data mining by using decision tree-based C4.5 algorithm, RamdomForest algorithm, IBK in the lazy classification algorithm and one of the NaiveBays algorithms and use m-fold cross validation to generate training set and decision tree, choose an algorithm with the best classification effect to construct a classifier, and classify the samples to be tested by using the constructed classifier;
Step 7: Construct a control flow graph for ELF samples to be tested, extract the features of the control flow graph and write them into ARFF files. The files are used as the input of the classifier constructed in step 6. The output of the classifier is the classification result.
According to the above proposal, step 4 also includes the supplement and repair to the generated control flow graph.
According to the above proposal, the partition rules of the basic blocks in step 3 are:
The program entry address is a basic block;
The target address of direct jump and branch instructions is a basic block, the jump and branch instructions are in the address ascending order, and the next address of non-null-operation instructions is a basic block;
Ignore indirect jump instructions, and ignore instructions with target address of direct jump and branch instructions as its own address.
According to the above proposal, the construction rules of the control flow graph in step 4 are:
A basic block is a vertex of the control flow graph, which is identified by the entry address of basic block without edge weights;
Each direct jump and branch instruction is represented by directed edges in the control flow graph;
For unconditional direct jump and branch instructions, a directed edge pointed from the basic block where the instruction is located to the basic block marked by the target address is established. For conditional jump and branch instructions, two directed edges are established;
The basic block where the return instruction is located has a directed edge pointing to the basic block where “the next instruction of the jump instruction correspondent to that return instruction” is located;
For recursive calls, add only one directed edge with the basic block pointing to itself.
According to the above-mentioned proposal, 22 features are extracted respectively: the total number of vertices, the total number of edges, the vertex number of function in import table, the maximum out-degree, the number of vertices that identify function names when disassembling, the number of vertices with in-degree as zero, the number of vertices with out-degree as zero, the maximum degree of graphs, the maximum in-degree, the number of vertices included in the maximal connected subgraph, the number of edges pointing to the vertices of import table, the number of vertices with both out-degree and in-degree as zero, the number of edges pointing to the vertices that identify function names when disassembling, the proportion of the vertices in import table to the total vertices, the number of connected sub-graphs, the proportion of vertices that identify function names, the proportion of vertices with in-degree as zero, the proportion of vertices with out-degree as zero, the proportion of vertices included in the maximal connected subgraph, the proportion of vertices with both out-degree and in-degree as zero, the proportion of the number of edges pointing to the vertices of import table to the total edges, the proportion of the edges pointing to vertices that identify the function name.
Compared with the existing technology, the invention has the beneficial effects: 1) it is not necessary to directly compare the huge feature database. The speed is faster and the unknown malware can be detected. 2) the classifier is small with fast training speed. The feature used is only the extracted subset of 22 features. 3) when the classifier is updated, it only needs to expand and update data set to train classifier, which takes less time. As the classifier updates, the detection time will not increase significantly. 4) compared with the methods based on software description information and local features, this method is more stable, and it is more difficult for malicious software developers to make targeted response to avoid detection.