1. Field of the Invention
The present invention relates to computers and computer networks. More particularly, the invention relates to managing network traffic in the Internet.
2. Background of the Related Art
Managing large networks involves several critical aspects such as traffic engineering, network planning and provisioning, security, billing and Quality of service (QoS), fault management, and reliability. The ability of a network operator to accurately classify network traffic into different applications (both known and unknown) directly determines the success of many of the above network management tasks. For example, identifying non-profitable P2P traffic could help an Internet Service Provider (ISP) in providing better quality of service to other revenue-generating delay/loss sensitive applications. Hence it is imperative to develop traffic classification techniques that are fast, accurate, robust, and scalable in order to meet current and future needs of ISPs.
The popularity of the new generation of smart applications (e.g., peer-to-peer (hereinafter as “P2P”) applications) has resulted in several new challenges for accurately classifying network traffic in the Internet. Traditionally, ISPs have used port numbers to effectively identify and classify network traffic. For example, HTTP traffic is usually carried over TCP port 80, SSH on TCP port 22, SMTP on TCP port 25, DNS on UDP port 53, and so on. Hence ISPs could detect, block, and/or shape any unwanted or unimportant traffic in their network. This approach is extremely easy to implement and introduces very little overhead on the traffic classifier. However, in order to circumvent detection (and subsequent blocking or shaping), recently developed applications (e.g., P2P file sharing networks) have started using non-standard ports for communication. For example, P2P networks can choose random ports to communicate with each other. Furthermore, they can also use other standard ports, like TCP port 80 traditionally used by HTTP, and tunnel their traffic from the source to the destination. These strategies at the application level have essentially made port number based traffic classification inaccurate and hence ineffective.
To address the problems with port based traffic classification, techniques that rely on application payload were developed. The strategy here is to first develop a signature for a given application by analyzing and/or reverse engineering the application layer protocol, a laborious and time-consuming process. Using this signature, a subsequent flow (i.e., traffic flow or network traffic flow) that belong to the application can be accurately identified using a straightforward pattern matching technique. Although this technique is fast, accurate, robust, reusable in different contexts (e.g., firewalls, routers, network address translations (NATs), etc.), and has been the de facto industry standard, it faces the problem of scalability for two reasons. First, keeping up with the number of applications that come up everyday is impractical. For example, several hundred new P2P and gaming protocols have been introduced over the last 5 years. Second, reverse engineering these applications becomes increasingly hard when applications start using encryption, a common strategy adopted by applications to avoid detection. Consequently, keeping an up-to-date list of application signatures became a challenging task for engineers.
Given the shortcomings of port-based and signature-based approaches, pattern classification techniques based on layer-3/layer-4 information have been developed that are less dependent on individual applications, but focused on capturing and extracting commonalities in the behavior of families of applications (i.e., application classes, e.g., gaming, voice, video, peer-to-peer, etc.). Some approaches examine the connection patterns at layer-3/layer-4 network traffic, and classify traffic into different application classes using machine learning and/or clustering algorithms; other approaches examine various specific attributes of flows to group them into different application classes.
Despite the development and advancement of the pattern classification techniques, there remains a need to provide techniques defining processes (or methods) for classifying application classes of the network traffic. It would be desirable to address several open questions about the applicability in the real world. First, due to the dependence on statistical techniques that need multiple flows and multiple packets from each flow, the time required to detect and report the discovery of an application class, is much longer compared to traditional layer-7 signature matching techniques. Second, most of the prior art techniques are incapable of differentiating individual applications behaving in a similar fashion at the macroscopic level. For instance, many of these techniques detect P2P traffic but cannot identify individual protocols (e.g., eDonkey, BitTorrent, or Gnutella), which is an important requirement for network operators to prioritize traffic. Third, these techniques are not as accurate and reliable as signature-based techniques since they are heavily dependent on the point of observation and network conditions (e.g., traffic asymmetry). Fourth, even though pattern classification, by monitoring only layer-3/layer-4 data, appears to be less resource consuming compared to the signature matching approaches, it is in fact not true. Pattern classification requires maintaining considerably larger number of states in memory for processing, and thus severely limits their effectiveness in operating at very high speeds.
Techniques have also been developed in the self-learning paradigm for traffic classification. The main goal of this paradigm is to minimize the manual intervention in detecting both known and unknown applications in networks. It has been shown that applications can be distinguished by just looking at the size and direction of the first N packets in every TCP connection. These packets are then clustered into application groups using techniques like K-Means, Gaussian Mixture Model, and spectral clustering. Despite the development and advancement of the self-learning techniques, there remains a need to address several open questions. First, rate of detection of unknown applications is not effective (e.g., about 60%). Second, all packets of a TCP connection in both directions are required to classify traffic. Given that most ISP networks employ asymmetric routing (i.e., the path taken by a packet from the source to destination can be completely different from that taken from destination to source), gaining access to both directions of traffic may not always be feasible. Third, both the complexity of the algorithms used (e.g., statistical clustering, machine learning, etc.) and the fact that these algorithms need to be executed on every flow seen by the classifier, make these techniques infeasible for real-time traffic classification in high speed networks.
In this paper, examples are given relating to a tier-1 ISP network, which supports the TCP/IP network data model known within the art. The TCP/IP network data model represents the network data in five layers including the Application layer (carrying application data), the Transport layer (carrying e.g., UDP datagram consisting of UDP header and UDP data), the Network layer (carrying e.g., IP packet consisting of IP header and IP data), the Data link layer (carrying e.g., frame header, frame data, and frame footer), and the Physical layer. Those skilled in the art also use the OSI model to represent the network data in seven layers including the Application layer (or layer 7), the Presentation layer (or layer 6), the Session layer (or layer 5), the Transport layer (or layer 4), the Network layer (or layer 3), the Data link layer (or layer 2), and the Physical layer (or layer 1). The Application layer (or layer 7), the Presentation layer (or layer 6), and the Session layer (or layer 5) of the OSI model roughly correspond to the Application layer of the TCP/IP model. Many other variations of layered network data model may also be implemented in a high speed network. In this paper, the layer-7 refers to the Application layer and the layer-3/layer-4 refers to the Network layer and/or the Transport layer.
P2P networks are categorized by those skilled in the art into structured P2P networks (e.g., CHORD, PASTRY, TAPESTRY, etc.) and unstructured P2P networks. In an unstructured P2P network, different peers join and leave the network as and then they please. Unstructured peer-to-peer (P2P) networks are inherently distributed and providing an infrastructure to all users to exchange files, music, video, and other information without relying on any centralized servers. Many popular P2P networks have several million users at anytime. This completely distributed approach to finding and exchanging information can lead to network meltdown. Most of the successful P2P networks that exist today adopt the strategy of constructing hybrid networks, where the P2P network elects a few nodes as leaders for groups of nodes based on the nodes' computing/network resources. These leaders are usually referred to as superpeers or ultrapeers. In this paper, P2P traffic refers to the traffic originating from these dynamic and hybrid P2P networks. Examples of such P2P networks include eDonkey, Gnutella, KaZaa, BitTorrent, Skype, etc.
Superpeers are typically connected to several other superpeers and the main objective here is to ensure that these superpeers (and hence the peers connected to them) are connected to the rest of the network. This architecture of P2P networks has a two-level hierarchy. The first level contains all the superpeers connected to several other superpeers in the same level. The second level contains peers connected to one or more superpeers in the first level. Note that these peers at the second level may or may not be connected to other peers in the same level. This architecture ensures that when peers join or leave a network, the impact on the network (in terms of connectivity of other peers) is minimal. However the impact is higher when superpeers leave the network. Hence nodes that have significantly higher uptime values are chosen to be superpeers. Although the actual functionality of a superpeer varies depending on the particular P2P application, in general, a superpeer acts as a gateway to the rest of the network for the group of peers that are connected to it.
Although peer-to-peer networks are application layer networks built on top of the IP layer, traffic from these networks behave very similar to the rest of the Internet traffic and is virtually indistinguishable. Hence, most of the strategies that have been proposed in the past for classifying P2P traffic based on only layer-3/layer-4 information rely on first detecting nodes that are running P2P applications, and then identifying P2P traffic based on the these P2P nodes. A common strategy adopted by most P2P networks to get around the connectivity problem introduced by firewalls is to use both TCP and UDP protocols on any of the open ports. Furthermore, to optimize their performance, P2P nodes typically use both TCP and UDP protocols for control, signaling, and/or data flows. For example, a Skype peer initially talks to its superpeer using UDP, but later establishes TCP connections as well to acquire the address of the login server. Taking advantage of this property of P2P networks, one of the heuristics that has been proposed is to identify all source-destination node pairs that use both TCP and UDP transport protocols to communicate with each other and flag them as P2P nodes.
Another characteristic that distinguishes a P2P node from a node that does not run any P2P applications is the P2P node's ability to act as both a client and a server. For instance, a web server typically receives connections but does not initiate connections. Similarly, a web client typically opens one or more connections to a web server, but does not accept connections. However, a P2P node has the ability to both accept and open connections (to upload/download data/files) at the same time. Several techniques proposed in literature try to utilize this property of P2P nodes as a heuristic to detect them.
However, there are several problems while using these heuristics to detect P2P nodes: (i) False Positives: Several other protocols in the Internet, such as DNS, gaming, streaming, IRC, etc., also exhibit these properties. In other words, these non-P2P applications also use both TCP and UDP protocols to communicate between node pairs. (ii) False Negatives: All P2P nodes (or P2P node pairs) do not always satisfy the above heuristics. For example, not all P2P node pairs use both TCP and UDP protocols to talk to each other. Several P2P protocols use TCP port 80 (a port most likely to be open in almost every firewall) as a way to bypass firewalls and hence may not use both TCP and UDP protocols.