This disclosure relates generally to the field of traffic monitoring in a computing network, and more specifically to determining heavy distinct hitters in a data stream transmitted over the computing network.
Today's computer infrastructures are highly distributed systems where data traffic is generated at many different locations. Metering or monitoring the data traffic in such a network may be performed for such purposes as troubleshooting, planning and billing. To facilitate metering and monitoring, network routers collect flow information that may be analyzed by processing units. A processing unit may perform tasks such as flow information collection, filtering, analysis, or aggregation. Traffic metering and monitoring may also be performed for security reasons. Anomalies that may indicate security issues may be detected by monitoring a data stream. For example, a processing unit may discover a distributed denial of service (DDoS) attack by observing that a large number of different machines are sending data packets to a small number of destinations. Another network anomaly is a single machine sending data packets to a large number of different destinations, indicating that the single machine may have been compromised and is being used to disseminate a worm.
Security problems such as a DDoS attack or worm dissemination may be detected by determining heavy distinct hitters (HDH) in the data stream. If each packet in a data stream is considered as an element-value (e,v) pair, where each element is a destination and each value is a source address, then the attacked machines in the DDoS scenario are those elements for which the number of distinct values in the observed data stream is large. Alternately, if an element is defined as a source address and a value is defined as a destination address, then the elements with the largest number of distinct values may correspond to compromised machines that are distributing a worm. The elements that occur in the data stream together with a large number of distinct values are heavy distinct hitters. It is desirable to identify the heavy distinct hitters as efficiently as possible.
The HDH problem may be approached by finding all elements that occur in the data stream paired with a number of distinct values that is greater than or equal to a particular threshold. The number of distinct values that occur together with an element may also be determined. However, finding the exact number of HDH elements and values paired with those elements requires a processing unit to store all distinct (e,v) pairs that are received in the data stream, and check for each arriving (e,v) pair whether or not it has already been received, which may require a large amount of memory and processing power, especially at high traffic rates.
To lower memory and processing requirements, a HDH approximation may be determined instead. For example, two parameters epsilon and delta may be defined in the range (0,1), epsilon being the allowed relative error in the estimates, and delta being the failure probability. A threshold T may also be defined as the minimum number of distinct values required for an element to be considered a heavy distinct hitter. After processing a portion of a data stream, an output set of elements that are heavy distinct hitters (i.e., occur with a number of distinct values that is greater than threshold T) may be determined and, for the elements in the set of heavy distinct hitters, the total number of distinct values that occurred with each element may be estimated. If an element is in the set of heavy distinct hitters, then the true number of distinct values that occur with this element is at least (1-epsilon)T, and if an element is not in the output set, then the true number of distinct values that occurs with this element is lower than (1+epsilon)T, and the error in the number of distinct values for all elements is at most epsilon*T. The approximation output must satisfy these conditions with probability at least 1-delta. Thus, the error is at most an epsilon fraction of the threshold T, and the whole process succeeds with probability 1-delta. Since the process succeeds most of the time, delta may be set to a much smaller value than epsilon. As delta is smaller, delta has a smaller impact on the space requirements to compute the approximate solution than epsilon. There may be an anomaly in the case in which one or a few elements occur with a larger number of distinct values than all other elements, or in other words, a few elements occur more often than a certain fraction of all distinct (e,v) pairs. For this situation, if d is the total number of distinct (e,v) pairs, the threshold T may be set to phi*d, where phi is another parameter in the range (0,1). However, approximation of HDHs in a data stream with a low epsilon and low delta with relatively low memory and processing requirements presents challenges.