1. Field of the Invention
The present invention relates to data stream management systems and more specifically to a system and method of sampling data streams
2. Introduction
Data stream management systems (DSMS) have found applications in network monitoring and financial monitoring in which large volumes of data require sophisticated processing in real time. Commercial examples include Gigascope for network monitoring, and Aleri Streaming Analytics, Gemfire Real-time Events, and Streambase for financial monitoring.
High-speed data streams can be bursty. For example, there are flash events on the network when legitimate traffic spikes sharply. During a Distributed Denial of Service (DDoS) attack, the load on a link can increase from 100,000 packets/sec to 500,000 packets/sec. Trading volumes bursts on individual securities are common, and even occur in entire markets during financial panics. Two examples from the New York Stock Exchange are Oct. 19, 1987 and Oct. 28, 1997. Even if the DSMS is configured to handle a high volume data stream during normal circumstances, during a burst period the DSMS might exhaust available resources such as CPU cycles, memory, and link capacities.
It is precisely during such highly-loaded instants such as a DDoS attack that the DSMS is most useful and analysts rely on it crucially to identify the attackers and protect the network. Similarly, it is during a financial spike or market volatility that analysts rely on a DSMS in order to identify price trends and protect market positions. Therefore, it is critical to build DSMSs that can gracefully perform and provide useful results even in highly loaded instants. That is, DSMSs often have to target instantaneous—not average—data rates.
The widely accepted solution proposed for use by DSMSs to handle overloaded conditions is load shedding. In particular all published systems employ per-tuple sampling: uniform random sampling of tuples at different levels of query hierarchy to reduce the load on processing nodes. A tuple is a finite sequence of objects, each of a specified type. However, for a large class of queries, uniform random sampling violates the query semantics and leads to meaningless or even incorrect output.
As an example, consider the query for computing flows from the packet data—summaries of packets between a source and a destination during a period of time. The group-by attributes are the source and destination IP address, the source and destination port, and the protocol, while the aggregates include the number of packets, the number of bytes transferred, and so on. The example is one particular aggregate, i.e., the OR of the TCP flags in the packets that comprise the flow. This information is vital for distinguishing between regular flows and attack flows (attack flows do not follow proper TCP protocols).
If one randomly drops packets, one cannot compute the aggregate on the flags properly, and therefore cannot distinguish between valid traffic and attack traffic. Thus, a natural stream query written by an analyst to detect attack traffic will result in incorrect output in existing data stream systems that drop tuples randomly without analyzing the query semantics.
In principle, there is a different sampling strategy that will work in the example above, namely, to drop all packets that belong to randomly chosen flows. For all flows that are not dropped, the query will correctly compute the OR aggregate of the TCP flags and the output will be correct, albeit a subset of the correct output.
This type of sampling is referred to as per-group sampling, where the random choice is over the groups (in this case, the group is defined by the attributes that comprise the flow, but in general, it may be any subset of attributes). Per-group sampling is known as being necessary for computing loss-sensitive aggregates such as OR, Min, Max, count of duplicates, and so on. Join queries are also sensitive to random sampling, so variants of group sampling have been proposed for approximate query systems based on samples of large data sets.
In a general purpose DSMS, what is needed in the art is a principled mechanism to determine a suitable sampling strategy for any query.