As networks have gotten faster and network traffic has exploded, network traffic measurement has become increasingly important for allocating network resources and ensuring network security. At the same time, this increased throughput has made these types of measurements, which are preferably taken at the line speed of the network, more challenging. Today, extremely efficient algorithms and data structures are needed to effectively measure such traffic.
For the purposes of many network traffic measurement problems, a network contact can be defined as a source and destination pair, for which the source has sent a network message, for example a network packet, to the destination. The source and destination can each be identified by a network address, such as an Internet Protocol (IP) address, a port number, a MAC address, or other addressing scheme; other fields in a packet header; or any combination thereof.
Spread estimation is an exemplary network traffic measurement problem with many practical applications. Spread estimation can refer to the estimation of the number of distinct destinations to which a source has sent messages during a measurement period (called the spread of the source or “fan-out”) or the estimation of the number of distinct sources which have sent messages to a particular destination during a measurement period (called the spread of the destination or “fan-in”). Intrusion detection systems typically use fan-out to detect port scans, in which an external host attempts to establish too many connections to different internal hosts or different ports of the same host. Fan-out can also be used to predict the infection rate of a worm by estimating the spread of each of the infected hosts. Fan-in can be used to detect distributed denial of service attacks when too many hosts send traffic to a receiver, i.e., the spread of a destination is abnormally high. A large server farm may use fan-in to estimate the spread of each server (as a destination) in order to assess how popular the server's content is, which provides a guidance for resource allocation. An institutional gateway may use fan-in to monitor outbound traffic and determine the spread of each external web server that has been accessed recently. This information can also be used as an indication of the server's popularity, which helps the local proxy to determine the cache priority of the web content.
A spread estimator may be a software, hardware, or firmware module on a network router (or firewall) that inspects network messages as they arrive and estimates the spread of each source or destination. A spread estimator typically implements two functions. The first function is to store contact information extracted from arriving messages or packets. The second function is to estimate the spread of each source based on the collected information. In addition to estimation of a source's spread, the role of source and destination may be exchanged to use the same spread estimator to measure the spread of a given destination.
A major technical challenge for spread estimation and other network traffic measurement problems is how to fit the spread estimator or other measurement module in a small high-speed memory. Today's core routers forward most network packets on a fast forwarding path between network interfaces that bypasses the CPU and main memory. To keep up with the line speed, it is desirable to operate the measurement module in fast but expensive, size-limited memory, such as SRAM. Because many other essential routing, security, and performance functions may also run from such memory, it is expected that the amount of high-speed memory allocated for each measurement module will be small. Moreover, depending on the application, the measurement period can be quite long, which requires the module to store an enormous amount of contacts or other information. For example, to measure the popularity of web servers, the measurement period is likely to be hours or even days. Hence, each measurement module's data structure is designed to be as compact as possible.
Returning to the example of spread estimation, consider the following scenario. Collected from the main gateway router at the University of Florida on a day in 2005, an Internet traffic trace produced around 10 million distinct contacts from 3.5 million distinct external sources. Assuming a network router can only allocate 1 MB of high-speed memory for a spread estimator, based on this scenario, an average of only 2.3 bits can be allocated for tracking the contacts from each distinct source over a day long measurement period. Today's traffic likely far exceeds these figures, and therefore would require an even more compact storage solution.
Existing estimators can be classified into several categories based on how they store contact information: 1) storing per-flow information, such as Snort and FlowScan, 2) storing per-source information, such as Bitmap Algorithms and One-level/Two-level Algorithms and 3) mapping sources to the columns of a bit matrix, where each column stores contacts from all sources that are mapped to it, such as the online streaming module proposed by Zhao et al. in “Detection of Super Sources and Destinations in High-Speed Networks: Algorithms, Analysis and Evaluation,” (IEEE JSAC, vol. 24, no. 10, October 2006) (referred to hereinafter merely as “OSM”). In the above described scenario, the first two categories will fail because 2.3 bits are not enough to store the contacts of each of 3.5 million distinct sources. Indeed, Snort maintains a record for each active connection and a connection counter for each source IP. Thus, keeping the per-flow state tends to be too memory-intensive for a high-speed router, particularly when the fast memory allocated to the function of spread estimation is small. In addition, the One-Level/Two-Level Algorithms maintain two hash tables where one hash table stores all distinct contacts that occurred during the measurement period, including the source and destination addresses of each contact, and the other hash table stores the source addresses and a contact counter for each source address. As discussed below, OSM is also ineffective because mapping multiple sources to one column introduces significant, irremovable errors in spread estimation.
For the One-Level/Two-Level Algorithms, a probabilistic sampling technique is often used to reduce the number of contacts to be stored. In addition, instead of storing the actual source/destination addresses in each sampled contact, bitmaps may be used to save space. For this technique, each source is assigned a bitmap where a bit is set for each destination that the source contacts. The number of contacts stored in a bitmap can be estimated based on the number of bits set. An index structure is used to map a source to its bitmap. The index structure is typically a hash table where each entry stores a source address and a pointer to the corresponding bitmap. However, such a spread estimator cannot fit in a tight memory space where only a few bits are available for each source. If each bitmap is sufficiently long, the number of bitmaps will have to be reduced and there will not be enough bitmaps for all sources.
One solution to the problem of not having enough bitmaps for all sources is to share each bitmap among multiple sources. For example, a simple spread estimator may use a bit matrix whose columns are bitmaps. Sources are assigned to columns through a hash function. For each contact, the source address is used to locate the column and, through another hash function, the destination address is used to determine a bit in the column to be set. The number of contacts stored in a column can be estimated based on the number of bits set. However, the estimation is for contacts made by all sources that are assigned to the column, not for the contacts of a specific source under query.
The information stored for one source in a column is the noise for others that are assigned to the same column. This noise must be removed in order to estimate the spread correctly. To solve this problem, OSM assigns each source randomly to l (typically three) columns through l hash functions, and sets one bit in each column when storing a contact. A source will share each of its columns with a different set of other sources. Consequently the noise (i.e., the bits set by other sources) in each column will be different. Based on such difference, OSM removes the noise and estimates the spread of the source.
However, OSM also has problems. Not only does it increase the overhead by performing l+1 hash operations, making l memory accesses and using l bits for storing each contact, but the noise can be too much to be removed in a compact memory space where a significant fraction of all bits (e.g., above 50%) are set. The columns that high-spread sources are assigned to have mostly ones; they are called dense columns, which present a high level of noise for other sources. The columns that only low-spread sources are assigned to are likely to have mostly zeros; they are called sparse columns. In OSM, each high-spread source will create l dense columns. In a tight space, dense columns account for a significant fraction of all columns. The probability for a low-spread source to be assigned to/dense columns is not negligible. Since these dense columns have many bits set at common positions, the difference-based noise removal will not work well, and the spread estimation will be inaccurate. The experimental results discussed below confirm this analysis.
Also related is the detection of stealthy spreaders using online outdegree histograms as proposed by Gao et al. in “Detecting Stealthy Spreaders Using Online Outdegree Histograms,” (Proc. of IEEE International Workshop on Quality of Service '07, pp. 145-153, June 2007). This solution detects the event of collaborative address scan by a large number of sources, each scanning at a low rate. It is able to estimate the number of participating sources and the average scanning rate, but it cannot perform the task of estimating the spread of each individual source in the arrival packets.
Existing estimators divide a memory space into bitmaps and then allocate the bitmaps to sources. If per-source bitmaps are used, and each bitmap has a sufficient number of bits, then the total memory requirement will be too large. On the other hand, if bitmaps are shared between sources, it is hard to remove the noise caused by sources that are assigned to the same bitmap.
Accordingly, there is a need for a data structure, method, and system for spread estimation that provides accurate estimates while using a very small memory space. Spread estimation is highlighted here as an illustrative example of a network traffic measurement problem that may be solved with embodiments of the subject invention. The subject invention can also be applied to obtain, store, and analyze other network traffic data.