Defintions:                Adaboost (Boosting): A method for combining multiple classifiers. Adaboost was introduced by [Freund and Schapire, 1997]. Given a set of examples and a base classifier, it generates a set of hypotheses combined by weighted majority voting. Learning is achieved in iterations. A new set of instances is selected in each iteration by favoring misclassified instances of previous iterations. This is done using an iteratively updated distribution that includes a probability for each instance to be selected in the next iteration.        Artificial Neural Networks (ANN) [Bishop, 1995]: An information processing paradigm inspired by the way biological nervous systems, such as the brain, process information. The key element is the structure of the information processing system, which is a network, composed of a large number of highly interconnected neurons working together in order to approximate a specific function. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process during which the individual weights of different neuron inputs are updated by a training algorithm, such as back-propagation. The weights are updated according to the examples the network receives, which reduces the error function. The next equation presents the output computation of a two-layered ANN, where x is the input vector, vi is a weight in the output neuron, g is the activation function, wij is the weight of a hidden neuron and bi,o is a bias.        
      f    ⁡          (      x      )        =      g    [                            ∑          i                ⁢                              v            i                    ⁢                      g            (                                                            ∑                  j                                ⁢                                                      w                    ij                                    ⁢                                      x                    i                                                              +                              b                i                                      )                              +              b        0              ]                  Antivirus and Antispywares: Software that enables to search for viruses and/or spywares and removes them. It can be up-dated to be able to check for new viruses and/or spywares. In the most cases, this type of software is signature based.        Backdoor (or trapdoor): A way for allowing access to IT system (program, online service or an entire system). These codes are written by the programmer who creates a program code or can be injected in the code by a malcode programmer.        Behaviour blocker: System which attempts to detect sequences of events in operating systems.        Binary code: Codes consisting of logical zeros and ones.        Binary file: Computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes.        Boosting: ML meta-algorithm for performing supervised learning.        Categorization: Process, in which entities are identified, differentiated and classified. Categorization implies that objects are grouped into categories, usually for some specific purpose.        Class: a collection of objects that can be unambiguously defined by one or more properties that all its members share.        Classifier: A rule set which is learnt from a given training-set, including examples of classes. Related to the present invention these classes are both malicious and benign files.        Data Mining (DM): It has been described as “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley, Piatetsky-Shapiro, Matheus 1992] and “the science of extracting useful information from large data sets or databases” [Hand, Mannila, Smyth 2001].        Decision Trees (DT): Well established family of learning algorithms [Quinlan, 1993]. Classifiers are represented as trees whose internal nodes are tests of individual features and whose leaves are classification decisions (classes). Typically, a greedy heuristic search method is used to find a small decision tree, which is induced from the data set by splitting the variables based on the expected information gain. This method correctly classifies the training data. Modern implementations include pruning, which avoids the problem of overfitting. One of the well known algorithms from this family is the C4.5 algorithm [Quinlan, 1993] and another implementation of it called J48 [Witten and Frank, 2005]. An important characteristic of decision trees is the explicit form of their knowledge, which can be easily represented as rules.        Feature Extraction (FE): A special form of dimensionality reduction based on the transformation of an input dataset into a set of features.        File: A collection of data or information called by a filename. It can contain textual, numerical, graphical, audio and video data. Many of them can by executable files (or can contain executable code).                    Benign file: File which did not contain malcode            Malcode file: File which contain at least a part of malcode                        Heuristic-Based methods: Methods which are based on rules defined by experts, which define a behavior in order to enable the detection them. [Gryaznov, 1999]        Imbalance problem: In ML, problem which can appear in the learning process when a class is represented by a large number of examples while the other is represented by only a few.        Information Retrieval (IR): Science of information searching. This informatiom can be a full document or a part of. A document in this field it is any element which contian data. A computer file is here a document.        Integrity checkers: System which periodically check for changes in files and disks.        Network Services (and Internet Network Services): “Softwares” which provide to a client computer services (such as webmail) or shared ressources (such as data sharing).        KNN (K-Nearest-Neighbor): An ML algorithm. It classifies an object according to a majority vote of its neighbors. An object is assigned to the class most common amongst its k nearest neighbors.        Machine learning (ML): Subfield of Artificial Intelligence and it rellated to the design and the development of algorithms and techniques allowing computers to “learn”.        Malcode (Malicious Code): commonly refers to pieces of code, not necessarily executable files, which are intended to harm, generally or in particular, the specific owner of the host. Malcodes are classified, mainly based on their transport mechanism, into five main categories: worms, viruses, Trojans and new group that is becoming more common, which is comprised of remote access Trojans and backdoors.        Naïve Bayes (NB) classifier: It is based on the Bayes theorem, which in the context of classification states that the posterior probability of a class is proportional to its prior probability, as well as to the conditional likelihood of the features, given this class. If no independent assumptions are made, a Bayesian algorithm must estimate conditional probabilities for an exponential number of feature combinations. NB simplifies this process by assuming that features are conditionally independent, given the class, and requires that only a linear number of parameters be estimated. The prior probability of each class and the probability of each feature, given each class is easily estimated from the training data and used to determine the posterior probability of each class, given a set of features. Empirically, NB has been shown to accurately classify data across a variety of problem domains [Domingos and Pazzani, 1997].        n-gram: Sub-sequence of n items from a given sequence.        Network: Defines a system which provides access paths for communication between users. Communication networks may be designed for data, voice and video. They may be private with a limited access or public with an open access.        Signature-Based methods: Methods, which rely on the identification of unique strings in the code; while being very precise, it is useless against unknown malicious code.        Spyware: Any software application that, without the user's knowledge, collects information about the system it resides on and reports that information back to a third party. Like a virus it can be loaded onto a computer without user.        Support Vector Machines (SVM): A binary classifier, which finds a linear hyperplane that separates the given examples of two classes known to handle large amounts of features. Given a training set of labeled examples in a vector format, the SVM attempts to specify a linear hyperplane that has the maximal margin, defined by the maximal (perpendicular) distance between the examples of the two classes. The examples lying closest to the hyperplane are known as the supporting vectors. The normal vector of the hyperplane (denoted as w in the next equation, in which n is the number of the training example) is a linear combination of the supporting vectors multiplied by LaGrange multipliers (alphas). Often the data set cannot be linearly separated, so a kernel function K is used. The SVM actually projects the examples into a higher dimensional space to create a linear separation of the examples.        
      f    ⁡          (      x      )        =            sign      ⁡              (                  w          ·                      Φ            ⁡                          (              x              )                                      )              =          sign      (                        ∑          i          n                ⁢                              α            i                    ⁢                      y            i                    ⁢                      K            ⁡                          (                                                x                  i                                ⁢                x                            )                                          )                      Test Set: Data set used to test the quality of a classification algorithm.        Training Set: Data set used to allow to a classification algorithm to identify classes.        Text Categorization: Classification of documents into a predefined number of categories. Each document can be in zero, one or more category. Using ML, the objective is to learn classifiers which perform the category assignments automatically.        Trojan (or Trojan Horse): A destructive program which is presented as a benign file. They do not replicate themselves like the viruses.        Virus: A program or piece of code that is loaded onto a computer without user knowledge and runs against user wishes. It can replicate it selves. A virus is not always destructive but can disturb the computer work using up the computer's resources.        Worm: A program that replicates itself over a computer network and usually performs malicious actions, such as using up the computer's resources.        
The recent growth in high-speed internet connections and in internet network services has led to an increase in the creation of new malicious codes for various purposes, based on economic, political, criminal or terrorist motives (among others). Some of these codes have been used to gather information, such as passwords and credit card numbers, as well as behavior monitoring. Current antivirus and antispywares technologies are primarily based on two approaches: signature-based and heuristic. Other proposed methods include behavior blockers and integrity checkers. However, besides the fact that these methods can be bypassed by viruses or spywares, their main drawback is that, by definition, they can only detect the presence of a malcode after the infected program has been executed, unlike the signature-based methods, including the heuristic-based methods, which are very time-consuming and have a relatively high false alarm rate. Therefore, generalizing the detection methods to be able to detect unknown malcodes is crucial.
In the previous approaches, classification algorithms were employed to automate and extend the idea of heuristic-based methods. The binary code of a file is represented by n-grams and classifiers are applied to learn patterns in the code and classify large amounts of data. Studies have shown that this is a very successful strategy. However, these ones present evaluations based on test collections, having similar proportion of malicious versus benign files in the training set and test collections (50% of malicious files). This has two potential failures. These conditions do not reflect real life situation, in which malicious code is commonly significantly less than 50% and additionally these conditions might report optimistic results.
Recent survey made by McAfee (McAfee Study Finds 4 Percent of Search Results Malicious, By Frederick Lane, Jun. 4, 2007: [http://www.newsfactor.com/story.xhtml?story_id=010000CEUEQO] indicates that about 4% of search results from the major search engines on the web contain malicious code. Additionally, Shin (Shin, J. Jung, H. Balakrishnan, Malware Prevalence in the KaZaA File-Sharing Network, Internet Measurement Conference (IMC), Brazil, October 2006) found that above 15% of the files in a Peer-to-Peer network contained malicious code. According to these data, the proportion of malicious files in real life is about or less than 10%.
Over the last years, several studies have investigated the direction of detecting unknown malcode based on its binary code. Shultz et al. [Shultz et al. 2001] were the first to introduce the idea of applying ML and DM methods for the detection of different malcodes based on their respective binary codes. They used three different FE approaches: program header, string features and byte sequence features, in which they applied four classifiers: a signature-based method (an antivirus one), Ripper—a rule-based learner, NB and Multi-Naïve Bayes. This study found that all of the ML methods were more accurate than the signature-based algorithm. The ML methods were more than twice as accurate when the out-performing method was Naïve Bayes, using strings, or Multi-NB using byte sequences. Abou-Assaleh et al. [2004] introduced a framework that used the common n-gram (CNG) method and the KNN classifier for the detection of malcodes. For each class, malicious and benign, a representative profile has been constructed and assigned a new executable file. This executable file has been compared with the profiles and matched to the most similar. Two different data sets were used: the I-worm collection, which consisted of 292 Windows internet worms and the win32 collection, which consisted of 493 Windows viruses. The best results were achieved by using 3 to 6 n-grams and a profile of 500 to 5000 features.
Kolter and Maloof [Kolter and Maloof 2004] presented a collection that included 1971 benign and 1651 malicious executables files. n-grams were extracted and 500 were selected using the Information Gain measure [Mitchell, 1997]. The vector of n-gram features was binary, presenting the presence or absence of a feature in the file and ignoring the frequency of feature appearances. In their experiment, they trained several classifiers: IBK (a KNN classifier one), a similarity based classifier called TFIDF classifier, NB, SMO (a SVM classifier one) and J48 (a Decision Tree). The last three of these were also boosted. Two main experiments were conducted on two different data sets, a small collection and a large collection. The small collection included 476 malicious and 561 benign executables and the larger collection included 1651 malicious and 1971 benign executables. In both experiments, the four best performing classifiers were Boosted J48, SVM, Boosted SVM and IBK. Boosted J48 out-performed the others. The authors indicated that the results of their n-gram study were better than those presented by Shultz and Eskin [Shultz and Eskin 2001].
Recently, Kolter and Maloof [Kolter and Maloof 2006] reported an extension of their work, in which they classified malcodes into classes based on the functions in their respective payloads. In the categorization task of multiple classifications, the best results were achieved for the classes' mass mailer, backdoor and virus (no benign classes). In attempts to estimate the ability to detect malicious codes based on their issue dates, these techniques were trained on files issued before July 2003, and then tested on 291 files issued from that point in time through August 2004. The results were, as expected, lower than those of previous experiments. Those results indicate the importance of maintaining such a training set through the acquisition of new executables, in order to cope with unknown new executables.
Henchiri and Japkowicz [Henchiri and Japkowicz 2006] presented a hierarchical feature selection approach which enables the selection of n-gram features that appear at rates above a specified threshold in a specific virus family, as well as in more than a minimal amount of virus classes (families). They applied several classifiers: ID3 (a Decision Tree), J48, NB, SVM and SMO to the data set used by Shultz et al. [Shultz et al. 2001] and obtained results that were better than those obtained through traditional feature selection, as presented in [Shultz et al., 2001], which mainly focused on 5-grams. However, it is not clear whether these results are more reflective of the feature selection method or the number of features that were used.
The class imbalance problem was introduced to the ML research community about a decade ago. Typically it occurs when there are significantly more instances from one class relative to other classes. In such cases the classifier tends to misclassify the instances of the low represented classes. More and more researchers realized that the performance of their classifiers may be suboptimal due to the fact that the datasets are not balanced. This problem is even more important in fields where the natural datasets are highly imbalanced in the first place [Chawla et al, 2004], like the problem we describe.
Salton presented the vector space model [Salton and Weng, 1975] to represent a textual file as a bag of words. After parsing the text and extracting the words, a vocabulary, of the entire collection of words is constructed. Each of these words may appear zero to multiple times in a document. A vector of terms is created, such that each index in the vector represents the term frequency (TF) in the document. Equation 1 shows the definition of a normalized TF, in which the term frequency is divided by the maximal appearing term in the document with values in the range of [0;1].
Another common representation is the TF Inverse Document Frequency (TFIDF), which combines the frequency of a term in the document (TF) and its frequency in the documents collection, as shown in Equation 2, in which the term's (normalized) TF value is multiplied by the IDF=log(N/n), where N is the number of documents in the entire file collection and n is the number of documents in which the term appears.
                              T          ⁢                                          ⁢          F                =                              term            ⁢                                                  ⁢            frequency                                max            ⁡                          (                              term                ⁢                                                                  ⁢                frequency                ⁢                                                                  ⁢                in                ⁢                                                                  ⁢                document                            )                                                          [                  Eq          .                                          ⁢          1                ]                                          TFIDF          =                      TF            ×                          log              ⁡                              (                DF                )                                                    ,                                  ⁢                              where            ⁢                                                  ⁢            DF                    =                      N            n                                              [                  Eq          .                                          ⁢          2                ]            
It is therefore an object to the present invention to provide a method for increasing the unknown malcode discovery and detection ratio.
It is another object to the present invention to provide a method based on IR concepts and text categorization.
It is a further object to the present invention to overcome the imbalance problem in the training step and to optimize classifiers test set to be in accordance with real-life scenarios.
It is still an object of the present invention to provide efficient malcode detection with commonly used classification algorithms.
It is yet an object to the present innovation to enable to provide efficient malcode detection with low levels of false alarms.
Further purposes and advantages of this invention will appear as the description proceeds.