Uniform random sampling has been known for its usefulness and efficiency for generating consistent and unbiased estimates of an underlying population. In this sampling scheme, every possible sample of a given size has the same chance to be selected. Uniform random sampling has been heavily used in a wide range of application domains such as statistical analysis, computational geometry, graph optimization, knowledge discovery, approximate query processing, and data stream processing.
When data subject to sampling come in the form of a data stream (e.g. stock price analysis, and sensor networks monitoring), sampling encounters two major challenges. First, the size of the stream is usually unknown a priori and, therefore, it is not possible to predetermine the sampling fraction (i.e., sampling probability) by the time sampling starts. Second, in most cases the data arriving in a stream cannot be stored and, therefore, have to be processed sequentially in a single pass. A technique commonly used in this scenario is reservoir sampling, which selects a uniform random sample of a given size from an input stream of an unknown size. Reservoir sampling has been used in many database applications including clustering, data warehousing, spatial data management, and approximate query processing.
Conventional reservoir sampling selects a uniform random sample of a fixed size, without replacement, from an input stream of an unknown size (see Algorithm 1, below). Initially, the algorithm places all tuples in the reservoir until the reservoir (of the size of r tuples) becomes full. After that, each kth tuple is sampled with the probability r/k. A sampled tuple replaces a randomly selected tuple in the reservoir. It is easy to verify that the reservoir always holds a uniform random sample of the k tuples seen so far. Conventional reservoir sampling assumes a fixed size reservoir (i.e., the size of a sample is fixed).
Algorithm 1: Conventional Reservoir SamplingInputs: r {reservoir size}1: k = 02: for each tuple arriving from the input stream do3:  k = k + 14:  if k ≦ r then5:   add the tuple to the reservoir6:  else7:   sample the tuple with the probability r/k and replace a     randomly selected tuple in the reservoir with the     sampled tuple8:  end if9: end for
In addition to its usefulness in sampling in the context of data streams, uniform random sampling has been extensively used in the database community for evaluating queries approximately. This approximate query evaluation may be necessary due to limited system resources like memory space or computation power. Two types of queries have been mainly considered: 1) aggregation queries and 2) join queries. As between the two types, it is far more challenging for join queries because uniform random sampling of join inputs does not guarantee a uniform random sample of the join output.
In the context of data stream processing, others have addressed that challenge with a focus on streaming out (without retaining) a uniform random sample of the result of a sliding-window join query with limited memory. There are, however, many data stream applications for which such a continuous streaming out is not practical. One example is applications that need a block of tuples (instead of a stream of tuples) to perform some statistical analysis like median, variance, etc. For these applications, there should be a way of retaining a uniform random sample of the join output stream.
Another example comes from the applications that collect results of join queries from wireless sensor networks using a mobile sink. Data collection applications have been extensively addressed in research literature. In these applications, a mobile sink traverses the network and collects data from sensors. Thus, each sensor needs to retain a uniform random sample of the join output, instead of streaming out the sample tuples toward the sink.
A natural solution to keep a uniform random sample of the join output stream is to use reservoir sampling. However, keeping a reservoir sample over stream joins is not trivial since streaming applications can be limited in memory size.