1. Field of the Invention
The present invention relates to computers and computer networks. More particularly, the invention relates to profiling Internet traffic flows to identify network applications responsible for the traffic flows.
2. Background of the Related Art
The evolution of the Internet in the last few years has been characterized by dramatic changes to the way users behave, interact and utilize the network. When coupled with the explosion of new applications sitting on the wire and the rising number of political, economic, and legal struggles over appropriate use of network bandwidth, it is easy to understand why now, more than ever, network operators are eager to posses a more precise and broader-in-scope information on which network applications are using their networks. The commercial world answered to this growing demand providing high-speed packet inspection appliances able to process up to 40 Gbps (gigabits per second) of traffic and supporting hundreds of packet content signatures. Still they appear to struggle in keeping up with the exponential rate at which new applications appear in the network. As a result, the attention of the research community has diverted to flow-based behavioral analysis techniques by applying sophisticated data mining algorithms that work on traffic flows (i.e., ignore packet content) to extract and analyze hidden properties of the traffic either in the forms of “social interaction” of hosts engaged in the communication or in the forms of “spatial-temporal analysis” of features such as flow duration, number and size of packets per flow, inter-packet arrival time. Apart from problems such as false positive and false negatives, these techniques are principally aimed at classifying a traffic flow with a broader application class (e.g., “P2P” (peer-to-peer) application class) rather than revealing the specific application (e.g., “P2P-KaZaA” of the many applications in the P2P application class) responsible for the traffic flow.
The demand for bandwidth management tools that optimize network performance and provide quality-of-service guarantees has increased substantially in recent years, in part, due to the phenomenal growth of bandwidth-hungry P2P applications. It is, therefore, not surprising that many network operators are interested in tools to manage traffic such that traffic critical to business or traffic with real-time constraints is given higher priority service on their network. Furthermore, security is becoming a challenge. Networks and institutions of any size are constantly being targeted with more and more sophisticated attacks. Critical for the success of any such tool is its ability to accurately, and in real-time, identify and categorize each network flow by the application responsible for the flow. Identifying network traffic using port numbers and protocol (e.g., layer-four protocols, such as TCP, UDP, etc.) was the norm in the recent past. This approach was successful because many traditional applications (e.g., layer-seven applications, such as HTTP, SMTP, etc.) use port numbers (e.g., port 80, port 25, etc.) assigned by or registered with the Internet Assigned Numbers Authority (IANA). For example, this technique labels all traffic on TCP port 80 to be HTTP traffic, all traffic on TCP port 25 to be SMTP, and so on. This approach is extremely simple to implement and introduces very little overhead on the classifier. The accuracy of this approach, however, has been seriously reduced because of the evolution of applications that do not communicate on standardized ports. Many current generation P2P applications use ephemeral ports, and in some cases, use ports of well-known services such as Web and FTP to make them indistinguishable to the port-based classifier. For example, BitTorrent® (a registered trademark of BitTorrent, Inc., San Francisco, Calif.) can run on TCP port 80 if all the other ports are blocked. In addition, applications can use or abuse random ports for communication. For example, BitTorrent® can communicate on any TCP or UDP network port that is configured by the user. Furthermore, applications can tunnel traffic inside other applications to prevent detection and/or for ease of implementation. For example, BitTorrent® can send all its data inside a HTTP session. These strategies at the application-level have essentially made port number based traffic classification inaccurate and hence ineffective.
To overcome these issues with port-based approach, techniques that rely on application payload have been developed. Typically, a payload content based signature is developed for a given application by reverse engineering the application/protocol. These signatures are agnostic to the application port usage and are usually accurate (i.e., low false positive and false negative rates). However, this approach faces the problem of scalability. In other words, keeping up with the number of applications that come up everyday is impractical due to the laborious manual reverse engineering process. For example, several hundred new P2P and gaming protocols have been introduced over the last several years. Reverse engineering all these applications in a timely manner requires a huge manual effort. In addition, reverse engineering these applications becomes increasingly difficult when applications use encryption to avoid detection. As a consequence, keeping a comprehensive and up-to-date list of application signatures is infeasible.
As is known to those skilled in the art, the web (or “World Wide Web”) is a system of interlinked hypertext documents (i.e., web pages) accessed via the Internet using URLs (i.e., Universal Resource Locators) and IP-addresses. The Internet is composed of machines (e.g., computers or other devices with Internet access) associated with IP-addresses for identifying and communicating with each other on the Internet. The Internet, URL, and IP-addresses are well known to those skilled in the art. The machines composing the Internet are called endpoints on the Internet. Internet endpoints may act as a server, a client, or a peer in the communication activity on the Internet. The endpoints may also be referred to as hosts (e.g., network hosts or Internet hosts) that host information as well as client and/or server software. Network nodes such as modems, printers, routers, and switches may not be considered as hosts. In vast majority of scenarios, information about servers such as the IP-address is publicly available for user to access. In peer-to-peer based communication, in which all endpoints can act both as clients or servers, the association between an end point and the P2P application becomes publicly visible. Even in the classical client-server communication scenario, information about clients such as website user access logs, forums, proxy logs, etc. also stay publicly available. Given that many forms of communication and various endpoint behaviors do get captured and archived, enormous amount of information valuable for profiling or characterizing endpoint behavior at a global scale is publicly available but has not been systematically utilized for such purpose.