Processing a massive data set presented as a large data stream is challenging. The data set may be, for example, network traffic data or streaming data from external memory. A difficulty encountered is that the data set is too large to store in cache for analysis. Because of the size of the data set, processing requires fast update time, use of very limited storage, and possibly only a single pass over the data.
In computing statistics for certain types of data set analysis, a problem is the correlated aggregate query problem. Here, unlike in traditional data streams, there is a stream of two dimensional data items (i, y), where i is an item identifier, and y is a numerical attribute. A correlated aggregate query requires first, applying a selection predicate along the y dimension, followed by an aggregation along the first dimension. An example of a correlated query is: “On a stream of IP packets, compute the k-th frequency moment of all the source IP address fields among packets whose length was more than 100 bytes”. Answering such a query may for example allow a network administrator to identify IP addresses using an excessive amount of network bandwidth.