1. Field of the Disclosure
The present disclosure is directed generally to classification of traffic in a packet network.
2. Description of the Related Art
Internet Protocol (IP) networks today carry a mixture of traffic for a diverse range of applications. The ability to accurately classify this traffic according to the types of applications that generate the traffic is vital to the workings of a wide swathe of IP network management functions including traffic engineering, capacity planning, traffic policing, traffic prioritization, monitoring service level agreements (SLAs) and security. For example, traffic classification is an essential first step for developing application workload characterizations and traffic models that in turn serve as inputs to efforts to optimize the network design, plan for future growth and adjust to changing trends in application usage.
Given the importance of the traffic classification problem, much effort has been devoted to develop traffic classifier systems and methods. The simplest classification method is to use port numbers, mapping the TCP or UDP server port of a connection to an application using the IANA (Internet Assigned Numbers Authority) list of registered or well known ports. Such fixed port based classification is known to be inherently unreliable for various reasons, all tied to the fact that an application has full discretion to determine its server port. Reasons why applications use non-standard ports include: (i) traversing firewalls, circumventing operating system restrictions and evading detection, (ii) dynamic allocation of server ports such as used by FTP for the data transfer, (iii) avoiding interference when the same standard port is used by multiple applications. For example, a SSH (secure shell) protocol, which runs on TCP port 22 is used both for interactive operations and for data downloads by the SCP (secure copy) file transfer protocol. For another example, many non-web applications are known to use ports 80, 8000, and 8080 (normally assumed to be “web ports”) for crossing firewalls which often have these ports open. These limitations have fueled efforts to devise alternative approaches that use specific features present in the application generated traffic to guide the classification.
Another approach to traffic classification develops content-based application signatures based on deep packet inspection (DPI) of application layer (layer 7) features, looking deeper into a packet than just the network and transport layer headers. While very accurate, the approach necessitates the use of traffic capture devices that can scale for use with high speed links. This is an expensive proposition which limits the ability to deploy it on a wide scale. Using application layer (layer 7) signatures is also expensive in terms of the computational resources needed to process large data volumes, (e.g., signatures) requiring evaluation of regular expressions and variable offset signatures. Furthermore, specific policy environments may limit how application layer (layer 7) information is collected or utilized. Lastly, this approach does not work for encrypted content—(e.g., all the application-level information is hidden by IP-level encryption techniques used by security protocols like IPSEC).
A different approach to traffic classification has been to use traffic classifiers with flow-level statistic inputs to identify network applications associated with particular traffic flows. Classifiers in general are software modules that use algorithms to provide a classification of an object based on an input set of features describing the object. A flow-based traffic classifier provides a classification of a traffic flow based on flow level statistics of a traffic flow. The use of flow-based traffic classifiers overcomes many of the problems of application layer (layer 7) approaches. Flow-based traffic classifiers are scalable. As flow reporting has been widely deployed in commercial routers, obtaining flow reports network wide does not put extra requirements on deployment or require development of new router features. In fact, many network providers already perform flow records collection as a daily routine operation. Furthermore, this approach also avoids the potential limitations of port and application layer approaches mentioned above.
Classifiers must be generated or trained with a set of training data (i.e., inputs for which the class is known) before they can accurately classify live data (i.e., inputs for which the class is not known). Two machine learning algorithms that may be used for classifier training are SVM and Adaboost. Successful operation of SVM and Adaboost relies on two characteristics. First, uniform convergence bounds predict that the classification error observed on the test data only diverges from the training error within predictable bounds that depend on the number of examples, not on the number of features. The key underlying assumption is that test examples are “independent identically distributed (IID).” That is, the test examples are picked randomly from the same distribution as the training data. Second, training is a convex optimization problem with a guaranteed convergence in a time that is super linear in the number of training examples. These characteristics encourage a “black box” approach: one collects every possible feature for a representative set of training examples and trains an off-the-shelf classifier. Prior work on application classification using machine learning has focused exclusively on such a black box approach. In reality, many of the above assumptions do not hold for network traffic and a straightforward “black box” application of traditional machine learning is not well-suited to the IP traffic classification problem and can fail spectacularly. Even though the traffic classification problem follows the definition of a typical multi-class classification problem, there are many unique challenges.
A first challenge for traffic classification is that the IID assumption does not hold. The composition of applications and their relative traffic contributions have natural spatial and temporal variations. Even at the same monitoring point, the amount of traffic contributed by an application can vary over time (e.g. different applications can have different time of day or time of week effects) and hence the training and test sets can have different distributions.
A second challenge for traffic classification is that typical networks have an extremely large amount of traffic. How to make the most use of the potential large training data set is a key issue, since most machine learning algorithms will experience scalability problems.
A third challenge for traffic classification is to achieve accuracy and stability. To be applicable to high-speed networks, a classifier should exhibit high classification accuracy, and in addition, must be fast enough to keep up with high traffic volumes.
A fourth challenge for traffic classification is to provide versatility. Under different scenarios, there will be different requirements for traffic classification. For example, for the purpose of Internet accounting and billing, it is desirable to achieve high byte accuracy instead of high flow accuracy. As another example, in application identification and anomaly detection and prevention, a fast detection method is preferred where a decision is needed to be made before the entire flow is observed.
Several approaches have been proposed for traffic classification using machine learning with flow statistics using a Naive Bayes classifier. Bonfiglio et al. develop two approaches based on Naive Bayesian classifiers and Pearson's Chi-Square tests to detect Skype traffic. They use flow level statistics such as the packet length and arrival rate as features to detect this traffic. Bernaille et al. propose an approach using unsupervised learning of application classes by clustering of flow features and a derivation of heuristics for packet based identification. Similarly, Crotti et al. use packet sizes, inter-arrival times, and arrival order of the first N packets as features for their classifier. This approach constructs protocol fingerprints, which are histograms of the observed variables for a flow. Erman et al. propose a semi-supervised machine learning approach based on clustering flow statistics. In addition to machine learning based approaches, Karagiannis et al. propose a classification approach based on using behavioral analysis of the communication patterns of hosts.
However, these approaches do not point to a robust and scalable solution that addresses many of the practical challenges that need to be solved before such machine learning based classification can be deployed in commercial networks.