Evaluating the performance of network protocols and mechanisms conventionally requires careful experimentation in simulators and testbed environments. As with any other performance analysis task, a critical element of these experiments is the availability of a realistic workload or set of workloads that can stress the technology in a manner that is representative of the deployment conditions. Despite the increasing availability of measurement results and packet traces from real networks, there is currently no accepted method for constructing realistic workloads. Consequently, networking researchers and performance analysts have relied on simplistic or incomplete models of network traffic for their experiments.
One essential difference between network workloads and other workloads, such those of storage systems, is the closed feedback loop created by the ubiquitous Transport Control Protocol (TCP). This protocol, responsible for the end-to-end delivery of the vast majority of the traffic on the Internet, reacts to network conditions by retransmitting lost packets and adjusting sending rates to the perceived level of congestion in the network. As a consequence, it is not enough to simply collect a trace of packets traversing a network element and replay the trace to conduct experiments, since the new conditions in the experimental environment would have had an effect on the behavior of TCP that is not present in the packet trace. In other words, replaying a packet trace breaks the feedback loop of TCP. For example, it is incorrect to collect a packet trace in a 1-Gbps link with a mean load of 650 Mbps and use it to evaluate a router servicing an optical carrier (OC) link transmitting at 622 Mbps (OC-12), because the replay would not capture the back-off effect of TCP sources as they detect congestion in the overloaded OC link. The analysis of the results of such an experiment would be completely misleading, because the traffic represents a set of behaviors of TCP sources that can never occur in practice. For example, the rate of queue overflow would be much larger in the experiment than in a real deployment where TCP sources would react to congestion and reduce the aggregate sending rate below the original 650 Mbps (thereby quickly reducing the number of drops). In such an scenario, an experiment that included estimating a metric related to response time, for example by looking at the duration of each TCP connection, would result in detecting virtually no difference between the original trace and the replay. In reality, however, the decrease in sending rate by the TCP sources in the congested scenario would result in much longer response times. Thus, valid experiments must preserve the feedback loop in TCP. Traffic generation must be based on some form of closed-loop process, and not on simple open-loop packet-level replays.
The fundamental idea of closed-loop traffic generation is to characterize the sources of traffic that drive the behavior of TCP. In this approach, experimentation generally proceeds by simulating the use of the (simulated or real) network by a given population of users using applications, such as file transfer protocol (FTP) or web browsers. Synthetic workload generators are therefore used to inject data into the network according to a model of how the applications or users behave. This paradigm of simulation follows the philosophy of using source-level descriptions of applications advocated in “Wide area traffic: the failure of Poisson modeling,” Floyd et al., IEEE/ACM Transactions on Networking, 3(3):226-244, 1995. The critical problem in doing network simulations is then generating application-dependent, network-independent workloads that correspond to contemporary models of application or user behavior.
The networking community, however, lacks contemporary models of application workloads. More precisely, validated tools and methods to go from measurements of network traffic to the generation of synthetic workloads that are statistically representative of the applications using the network are needed. Current workload modeling efforts tend to focus on one or a few specific applications. One example of a workload modeling method includes modeling web browsers. The status quo today for modeling web workloads uses a set of generators that are based on web-browsing measurements that were conducted several years ago. The web-browser measurements were based on a limited set of users. The measurements have not been maintained and updated as uses of the web have evolved. Thus, even in the case of the most widely-studied application, there remains no contemporary model of HTTP workloads and no model that accounts for protocol improvements (e.g., the use of persistent connections in HTTP/v1.1) or newer uses of the web for peer-to-peer file sharing and remote email access.
A major limitation of current source-level modeling approaches is that they construct application-specific workload models. Given the complexity inherent in this approach (e.g., the effort involved in understanding, measuring, and modeling specific application-layer protocols), it is understandable that workload models usually consider only one or a small number of applications. However, few (if any) networks today carry traffic from only one or two applications or application classes. Most links carry traffic from hundreds or perhaps thousands of applications in proportions that vary widely from link to link.
This issue of application mixes is a serious concern for networking researchers. For example, in order to evaluate the amount of buffering required in a router under real conditions or the effect of a TCP protocol enhancement, one of the factors to be considered is the impact on/from the applications that consume the majority of bandwidth on the Internet today and that are projected to do so in the future. It would be natural to consider the performance implications of the scheme on web usage (e.g., the impact on throughput or request-response response times), on peer-to-peer applications, streaming media, other non-interactive applications such as mail and news, and on the ensemble of all applications mixed together. The majority of previous work in workload modeling has focused on the development of source-level models of single applications. Because of this, there are no models for mixes of networked applications. Worse, the use of analytic (distribution-based) models of specific TCP applications does not scale to developing workload models of application mixes comprised of hundreds of applications. Typically when constructing workload models, the only means of identifying application-specific traffic in a network is to classify connections by port numbers. For connections that use common reserved ports (e.g., port 80) one can, in theory, infer the application-level protocol in use (HTTP) and, with knowledge of the operation of the application level protocol, construct a source-level model of the workload generated by the application. However, one problem with this approach for HTTP is that a number of applications, such as simple object access protocol (SOAP), are essentially using port 80 as an access mechanism to penetrate firewalls and middleboxes.
A deeper problem with this approach is that a growing number of applications use port numbers that are not readily known, e.g., they have not been registered with the Internet Assigned Numbers Authority (IANA). Worse, many applications are configured to use port numbers assigned to other applications (allegedly) as a means of hiding their traffic from detection by network administrators or for passing through firewalls. For example, in a study of traffic received from two broadband Internet service providers by AT&T in 2003, the source (application) of 32-48% of the bytes could not be identified. Similarly, analyses of backbone traffic in Sprint and Internet2 networks did not identify the source of 25-40% of bytes, depending on the studied link. However, even if all connections observed on a network could be uniquely associated with an application, constructing workload models requires knowledge of the (sometimes proprietary or hidden) application-level protocol to deconstruct a connection and understand its behavior. This is a very time-consuming process, and doing it for hundreds of applications (or even the top twenty) in network traffic is a daunting task.
Conventional methods have been unable to construct statistically sound workload models from network packet traces that capture the richness in the mix of applications using a given link without requiring knowledge of the associated application-level protocols. Accordingly, there exists a need for methods, systems, and computer program products for modeling and simulating application-level traffic characteristics in a network based on transport and network layer header information, which is application-neutral.