1. Field of the Invention
The present invention relates to computers and computer networks. More particularly, the invention relates to profiling Internet traffic flows to identify network applications responsible for the traffic flows.
2. Background of the Related Art
The evolution of the Internet in the last few years has been characterized by dramatic changes to the way users behave, interact and utilize the network. When coupled with the explosion of new applications sitting on the wire and the rising number of political, economic, and legal struggles over appropriate use of network bandwidth, it is easy to understand why now, more than ever, network operators are eager to posses a more precise and broader-in-scope information on which network applications are using their networks. The commercial world answered to this growing demand providing high-speed packet inspection appliances able to process up to 40 Gbps (gigabits per second) of traffic and supporting hundreds of packet content signatures. Still they appear to struggle in keeping up with the exponential rate at which new applications appear in the network. As a result, the attention of the research community has diverted to flow-based behavioral analysis techniques by applying sophisticated data mining algorithms that work on traffic flows (i.e., ignore packet content) to extract and analyze hidden properties of the traffic either in the forms of “social interaction” of hosts engaged in the communication or in the forms of “spatial-temporal analysis” of features such as flow duration, number and size of packets per flow, inter-packet arrival time. Apart from problems such as false positive and false negatives, these techniques are principally aimed at classifying a traffic flow with a broader application class (e.g., “P2P” (peer-to-peer) application class) rather than revealing the specific application (e.g., “P2P-KaZaA” of the many applications in the P2P application class) responsible for the traffic flow.
The demand for bandwidth management tools that optimize network performance and provide quality-of-service guarantees has increased substantially in recent years, in part, due to the phenomenal growth of bandwidth-hungry P2P applications. It is, therefore, not surprising that many network operators are interested in tools to manage traffic such that traffic critical to business or traffic with real-time constraints is given higher priority service on their network. Furthermore, security is becoming a challenging. Networks and institutions of any size are constantly being targeted with more and more sophisticated attacks. Critical for the success of any such tool is its ability to accurately, and in real-time, identify and categorize each network flow by the application responsible for the flow. Identifying network traffic using port numbers was the norm in the recent past. This approach was successful because many traditional applications use port numbers assigned by or registered with the Internet Assigned Numbers Authority (IANA). The accuracy of this approach, however, has been seriously reduced because of the evolution of applications that do not communicate on standardized ports. Many current generation P2P applications use ephemeral ports, and in some cases, use ports of well-known services such as Web and FTP to make them indistinguishable to the port-based classifier.
Techniques that rely on inspection of packet contents have been proposed to address the diminished effectiveness of port-based classification. These approaches attempt to determine whether or not a flow contains a characteristic signature of a known application. However, packet-inspection approaches face two severe limitations. First, these techniques only identify traffic for which signatures are available. Maintaining an up-to-date list of signatures is a daunting task. Information is rarely available, up-to-date or complete. Furthermore, the traditional ad-hoc growth of IP (i.e., Internet Protocol) networks, the continuing rapid proliferation of applications of different kinds, and the relative ease with which almost any user can add a new application to the traffic mix in the network with no centralized registration, are some factors contributing to this “knowledge gap”. Second, packet inspection techniques only work if and only if full packets (i.e., header and payload) are available as an input and are completely ineffective when only coarser information at traffic flow level is available. Unfortunately, only a few service providers today have instrumented their networks with packet inspection appliances while the majority of them have access only to traffic flows extracted directly from the routers.
The web (or “World Wide Web”) is a system of interlinked hypertext documents (i.e., web pages) accessed via the Internet using URLs (i.e., Universal Resource Locators) and IP addresses. The Internet is composed of machines (e.g., computers or other devices with Internet access) associated with IP addresses for identifying and communicating with each other on the Internet. The Internet, URL, and IP addresses are well known to those skilled in the art. The machines composing the Internet are called endpoints on the Internet. Internet endpoints may act as a server, a client, or a peer in the communication activity on the Internet. The endpoints may also be referred to as hosts (e.g., network hosts or Internet hosts) that host information as well as client and/or server software. Network nodes such as modems, printers, routers, and switches may not be considered as hosts. In vast majority of scenarios, information about servers such as the IP address is publicly available for user to access. In peer-to-peer based communication, in which all endpoints can act both as clients or servers, the association between an end point and the P2P application becomes publicly visible. Even in the classical client-server communication scenario, information about clients such as website user access logs, forums, proxy logs, etc. also stay publicly available. Given that many forms of communication and various endpoint behaviors do get captured and archived, enormous amount of information valuable for profiling or characterizing endpoint behavior at a global scale is publicly available but has not been systematically utilized for such purpose.