This disclosure relates generally to the field of network traffic analysis, and more specifically to determination of heavy-hitters given a stream of elements.
Determining the largest traffic flows in a network is important for many network management applications; this determination is known as the heavy-hitter problem. Heavy-hitter information is useful for applications such as identifying denial of service (DoS) attacks, monitoring traffic growth trends, provisioning network resources and link capacities, and identifying heavy network users that may need to reduce usage. In addition, determination of heavy hitters has applications for search engines that may compute heavy-hitter queries in order to optimize caching for such queries, and dynamic content providers that may keep track of frequently-clicked advertisements.
The problem of determining heavy hitters involves finding the specific elements within a stream of elements with a frequency above a user-selected threshold. Each element may represent a flow, and a sequence of identical elements may represent bytes or packets of a flow. A flow is typically defined as the set of packets that have common values in one or more packet-header fields. The most common flow definition is a five-tuple of the following packet-header fields: source and destination IP addresses, source and destination port numbers, and protocol number. An element identifier may be stored for each traffic flow with a corresponding counter monitoring the number of occurrences of that traffic flow. Sorting the elements according to their respective counters will generate a list of heavy hitting flows. However, this solution may not be feasible in some situations. Data streams may have a very large number of distinct elements, which may result in overwhelming and unpredictable memory requirements for storing element identifiers and counters. Consider the case of a NetFlow collector that computes the traffic flows that have generated the most traffic over a period of a month. In a small enterprise network, the number of unique five-tuple flows over a period of a month may be close to 100 million, which corresponds to 2.5 GBytes of memory for storing 136-bit flow identifiers and 64-bit counters. Such large memory requirements prohibit the use of the simple solution in NetFlow collectors and in other systems for computing heavy hitters of data streams with large numbers of distinct elements. Use of a large amount of disk space to store flow identifiers and counters may also severely impact system performance, slowing down processing times.
There are alternate techniques for computing heavy hitters using fixed or bounded memory resources. Lossy counting approximates the heavy hitters of a data stream by estimating the frequencies of elements in a stream to find heavy hitters. Lossy counting may operate as follows: an input stream of elements is split into fixed-size windows, and each window is processed sequentially. For each element in a window, an entry is inserted into a table, or, if the element is already in the table, the element's frequency counter is updated. At the end of each window, elements of low frequency are removed from the table. The table therefore maintains a relatively small number of entries. A deterministic error bound is also stored for each element in the table; the deterministic error bound is equal to the index of the current window minus 1. The error bound reflects the potential error of the estimated frequency of an element due to possible prior removal(s) at the end of a prior window of the element from the table. An element with a small error bound is more likely to be removed from the table than an equal-frequency element having a large error bound. However, lossy counting may still require a large amount of memory and processing power, and the computed heavy hitters may include false positives.
There exists a need for a method of determining heavy-hitters that is accurate while requiring relatively low amounts of memory and processing power.