1. Field of the Invention
The present invention relates to computers and computer networks. More particularly, the invention relates to profiling Internet traffic flows to identify network applications and/or security threats responsible for the traffic flows.
2. Background of the Related Art
In the past years, the number of cyber attacks keeps increasing affecting millions of systems. Such malicious activities, often termed as Malware (acronym from malicious software), includes different worms, botnets, trojans, backdoors, spyware, etc. Then, there is a new trend in exploiting social networks and mobile devices. Also, the sophistication and effectiveness of cyber-attacks have steadily advanced. These attacks often take advantage of flaws in software code, use exploits that can circumvent signature-based tools that are commonly used to identify and prevent known threats, and social engineering techniques designed to trick the unsuspecting user into divulging sensitive information or propagating attacks. These attacks are becoming increasingly automated with the use of botnets-compromised computers that can be remotely controlled by attackers to automatically launch attacks. Bots (short for robots) have become a key automation tool to speed the infection of vulnerable systems and are extremely stealthy in the way they communicate and ex-filtrate personal/proprietary information from the victims' machines/servers. The integration of such sophisticated computer attacks with well-established fraud mechanisms devised by organized crime has resulted in an underground economy that trades compromised hosts, personal information, and services in a way similar to other legitimate economies. This expanding underground economy makes it possible to significantly increase the scale of the frauds carried out on the Internet and allows criminals to reach millions of potential victims.
Such continuous and ever changing challenges to protect the users has made cyber-security is a very active and bleeding-edge research. This has become an arm race between the security researchers and malicious users. Today's approach to information security can be broken down into two major classes of technologies, host security, and network security.
A prevalent category of host-based security is malware prevention, comprising a broad group of agent-based solutions that look for particular signatures and behavioral signs of malicious code execution at the host level. This approach, known as blacklisting, focuses on matching specific aspects of application code and particular actions being attempted by applications for detection. Signature-based/blacklisting detection has been around for many years. In that same time, viruses, worms, sniffers, trojans, bots and other forms of malware have infiltrated e-mail, instant messaging, and later, social networking sites for the purpose of criminal financial gain. With improvements in correlation and centralized management, blacklisting still works very effectively in most distributed enterprise and capable to (i) pinpoint malicious activities with high detection rate while very low false positive/false negative rates, (ii) reverse engineering the malware executable to highlight malware inner properties such as message structure and message passing (strengths and weaknesses of the malware), and (iii) assess the level of risk of the threat by analyzing effects to the end-host (such as system calls, registries being touched, etc). However, because these signature-based models depend on advanced knowledge of malicious code and behaviors, some instances can be missed, leading to potential malicious execution.
On the network side, three prevalent approaches are blended together to offer network-based security, (i) firewall systems, (ii) intrusion detection/prevention systems (IDS/IPS) and (iii) network behavior anomaly detection (NBAD) systems. These three different approaches complement each other and are commonly adopted/deployed by enterprises to form a holistic network security strategy. Generally, the first two approaches tackle the network security problem in a similar fashion as the host security (usage of threat signatures specialized at the network level), and thus prone to similar benefits and shortfalls as for the host security. The third approach attempts to discover threats without requiring a-priori knowledge of the malicious code and behavior by using algorithms to generate model(s) that retain(s) the properties of good traffic and alarm for sessions that do not conform to the model. While effective in spotting threats never seen before, the third approach is still prone to high rate of false positive/false negative that the security analyst is forced to screen before making a decision. This shortfall is mostly due to the lack of a solid ground truth that the statistical tools can be trained on to produce precise statistical models emulating the threat activities.
A support vector machine (SVM) is a set of supervised learning methods that analyze statistically related data items and recognize patterns for classification and regression analysis. In particular, the SVM is a non-probabilistic binary linear classifier that receives a set of input data and predicts, for each given input, which of two possible classes the input belongs to. Given a set of training data items, each marked as belonging to one of two classes, an SVM training algorithm builds a model that assigns new data items into one class or the other. An SVM model is a representation of the data items as points in a hyperspace, mapped so that the data items of the separate classes are divided by a clear gap that is as wide as possible. New data items are then mapped into that same hyperspace and predicted to belong to a class based on which side of the gap they fall on.
The statistically related data items may correspond to points in a finite dimensional space, where each coordinate corresponds to one feature of the data items. The two classes of the SVM are often not linearly separable in that space. This finite dimensional space may be mapped into a higher dimensional space to allow easier separation by using a kernel method. Kernel methods are a class of algorithms for pattern analysis to find general types of relations (e.g., clusters, rankings, principal components, correlations, classifications) in general types of data items (e.g., sequences, text documents, sets of points, vectors, images, etc.). Kernel methods use a weighting function, referred to as a kernel, in kernel density estimation to estimate random variables' density functions. In particular, the use of the kernel enables the kernel methods to operate in the higher dimensional space without computing the coordinates of the data items in the higher dimensional space.
Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. In a decision tree, leaves (i.e., leaf nodes) represent class labels and branches (i.e., edges) represent conjunctions of features that lead to those class labels.