Collection and summarization of network traffic data is necessary for many applications including billing, provisioning, anomaly detection, inferring traffic demands, and conjuring packet filters and routing protocols. Traffic includes interleaving packets of multiple flows but the summaries should support queries on statistics of subpopulations of IP flows, such as the amount of traffic that belongs to a particular protocol, originate from a particular AS, or both. These queries are posed after the sketch is produced. Therefore, it is critical to retain sufficient metadata information and provide estimators that facilitate such queries.
IP packet streams are processed in real-time at the routers by systems, such as Cisco's sampled NetFlow (NF) or processed by software tools, such as Gigascope [8]. Two critical resources in the collection of data are the high-speed memory (usually expensive fast SRAM) and CPU power that are used to process the incoming packets. The available memory limits the number of cached flows that can be actively counted. The processing power limits the level of per-packet processing and the fraction of packets that can undergo higher-level processing.
The practice is to obtain periodic summaries (sketches) of traffic by applying a data stream algorithm to the raw packet stream. NF samples packets randomly at a fixed rate. Once a flow is sampled, it is cached, and a counter is created that counts subsequent sampled packets of the same flow. The number of counters is the number of distinct sampled flows. The packet-level sampling that NF performs serves two purposes. First, it addresses the memory constraint by reducing the number of distinct flows that are cached (the bulk of small flows is not sampled). Without sampling, a counter is needed for each distinct flow in the original stream. Second, the sampling reduces the processing power needed for the aggregation, since only sampled packets require the higher-level processing required to determine if they belong to a cached flow.
An algorithm that is able to count more packets than NF using the same number of statistics counters (memory) is sample-and-hold (SH) [13, 12]. With SH, as with NF, packets are sampled at a fixed rate and once a packet from a particular flow is sampled, the flow is cached. The difference is that with SH, once a flow is actively counted, all subsequent packets that belong to the same flow are counted (with NF, only sampled packets are counted). SH sketches are considerably more accurate than NF sketches [13, 12]. A disadvantage of SH over NF, however, is that the summarization module must process every packet in order to determine if it belongs to a cached flow. This additional processing makes it less practical for high volume routers.
NF and SH use a fixed packet sampling rate, as a result, the number of distinct flows that are sampled and therefore the number of statistics counters required is variable. When conditions are stable, the number of distinct flows sampled using a given sampling rate has small variance. Therefore one can manually adjust the sampling rate so that the number of counters does not exceed the memory limit and most counters are utilized [12]. Anomalies such as DDoS attacks, however, can greatly affect the number of distinct flows. A fixed-sampling-rate scheme can not react to such anomalies as its memory requirement would exceed the available memory. Therefore, anomalies would cause disruption of measurement or affect router performance. These issues are addressed by adaptive variants that include adaptive sampled NetFlow (ANF) [13, 11, 16] and adaptive SH (ASH) [13, 12]. These variants adaptively decrease the sampling rate and adjust the values of the statistics counters as to emulate sampling with a lower rate.