For many data analysis tasks, it is impractical to collect all the data at a single site and process it in a centralized manner. For example, data arrives at multiple network routers at extremely high rates, and queries are often posed on the union of data observed at all the routers. Since the data set is changing, the query results could also be changing continuously with time. This has motivated the continuous, distributed, streaming model. In this model there are k physically distributed sites receiving high-volume local streams of data. These sites talk to a central coordinator that has to continuously respond to queries over the union of all streams observed so far. A challenge is to minimize the communication between the different sites and the coordinator, while providing an accurate answer to queries at the coordinator at all times.
A problem in this setting is to obtain a random sample drawn from the union of all distributed streams. This generalizes the classic reservoir sampling problem to the setting of multiple distributed streams, and has applications to approximate query answering, selectivity estimation, and query planning. For example, in the case of network routers, maintaining a random sample from the union of the streams is valuable for network monitoring tasks involving the detection of global properties. Other problems on distributed stream processing, including the estimation of the number of distinct elements and heavy hitters, use random sampling as a primitive.
The study of sampling in distributed streams was initiated by prior work. Consider a set of k different streams observed by the k sites with the total number of current items in the union of all streams equal to n. Prior work has shown how k sites can maintain a random sample of s items without replacement from the union of their streams using an expected O((k+s)logs) messages between the sites and the central coordinator. The memory requirement of the central coordinator is s machine words, and the time requirement is O((k+s)log n). The memory requirement of the remote sites is a single machine word with constant time per stream update. Prior work has also proven that the expected number of messages sent in any scheme is (k+s log(n/s)). Each message is assumed to be a single machine word, which can hold an integer of magnitude poly(kns).