The present invention relates generally to the field of providing synopses for databases and, more specifically, to maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions.
One means for providing a synopsis of a database is to maintain a random sample of the data. Such samples may be used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration.
Uniform random sampling, in which all samples of the same size are equally likely, is a fundamental database sampling scheme. Uniform sampling is typically used in applications because most statistical estimators—as well as the formulas for confidence bounds for these estimators—assume an underlying uniform sample. Thus, sample uniformity is desirable if it is not known in advance how the sample will be used. Uniform sampling may also be used as a building block for more complex sampling schemes, such as stratified sampling. Methods for producing uniform samples are, therefore, important to modern database systems.
To provide a database synopsis, a uniform sample may be computed from a dataset that is stored on disk, such as a table in a relational database management system (RDBMS) or a repository of XML documents. Such a sample may be computed as it is needed (i.e., on the fly) or, alternatively, an initial sample may be incrementally maintained by updating the sample as the dataset changes. Incremental maintenance of a synopsis can have significant cost advantages—in that each access of the database may incur, for example, time or processing costs—for example, by amortizing the costs of maintenance of the sample over multiple uses of the sample. Challenges in sample maintenance are (1) to enforce statistical uniformity in the presence of arbitrary insertions and deletions to the dataset, (2) to avoid accesses to the base data (the dataset) to the extent possible, because such accesses are typically expensive, and (3) to keep the sample size as stable as possible, avoiding oversized or undersized samples compared to the size of the dataset.
Datasets may be distinguished as either “stable” datasets whose size (but not necessarily composition) remains roughly constant over time or “growing” datasets in which insertions occur more frequently than deletions over the long run. The former type of dataset generally is typical of transactional database systems and databases of moving objects; the latter type of dataset generally is typical of data warehouses in which historical data accumulates.
For stable datasets, it is highly desirable from a systems point of view to ensure that the sample size stays below a specified upper bound, so that memory for the sample can be allocated initially, with no unexpected memory overruns occurring later on. Moreover, once memory has been allocated for the sample, the sample size should be kept as close to the upper bound as possible in order to maximize the statistical precision of applications that use the sample. In other words, it is desirable to use the allotted space efficiently.
For growing data sets, maintaining a bounded sample (i.e., the sample size stays below an upper bound) generally is of limited practical interest. Over time, such a sample represents an increasingly small fraction of the dataset as the dataset grows. Although a diminishing sampling fraction may not be a problem for tasks such as estimating a population sum, many other tasks—such as estimating the number of distinct values of a specified population attribute—require the sampling fraction to be bounded from below. The goal for a growing data set is therefore to grow the sample in a stable and efficient manner, while also guaranteeing an upper bound on the sample size at all times and using the allotted space efficiently.
A well-known method for incrementally maintaining a sample in the presence of a stream of insertions to the dataset is the classical “reservoir sampling” algorithm, which maintains a simple random sample of a specified size M. Reservoir sampling is a uniform scheme that maintains a random sample of fixed size M, given a sequence of insertions. The reservoir sampling procedure initially includes the first M items into the sample. For each successive insertion into the dataset, reservoir sampling includes the inserted item into the sample with probability M/|R|, where |R| is the size of the dataset R just after the insertion; an included item replaces a randomly selected item in the sample.
It is known in the art to reduce the computational costs of reservoir sampling by devising a method to directly generate the (random) number of arriving items to skip between consecutive sample inclusions, thereby avoiding the need to “flip a coin” (e.g., generate an include/exclude decision using a pseudo-random number generator) for each item. One deficiency of the reservoir sampling method is that it cannot handle deletions, and the most obvious modifications for handling deletions either yield procedures for which the sample size systematically shrinks to zero over time or which require expensive base-data accesses, i.e., accesses to the dataset R. Another deficiency is that streams of insertions (and no deletions) to the dataset—for which reservoir sampling is designed—result in growing datasets as discussed above; so that the usefulness of the bounded reservoir sample tends to diminish over time.
Another well-known method for incrementally maintaining a sample in the presence of a stream of insertions to the dataset is the Bernoulli sampling scheme with sampling rate q, denoted BERN(q). Using BERN(q), each inserted item is included in the sample with probability q and excluded with probability 1−q, independent of the other items. For a datase R, the sample size |S| follows the binomial distribution BINOM(|R|, q), so that the probability that the size of the sample S is k for k=0, 1, . . . , |R| may be calculated asP{|S|=k}=Binomial coefficient(|R|;k)qk(1−q)|R|−k.Although the sample size k is random, samples having the same size are equally likely, so that the BERN(q) scheme is indeed uniform as described above. Bernoulli sampling may exhibit uncontrollable variability of the sample size. Indeed, the sample can be as large as |R|, so there is no effective upper bound on sample size.