1. Field of Invention
The present invention relates generally to the field of sampling. More specifically, the present invention is related to systems and methods to create uniform samples.
2. Discussion of Prior Art
Random sampling has been recognized as an invaluable tool for efficiently analyzing large data sets. A uniform random sample of values can be used for deriving quick approximate answers to analytical queries, for auditing data, and for exploring data interactively, in a setting of large-scale data repositories and warehouses. A uniform random sample from an input population is a subset of population elements, chosen in such a way that the chance of any of the possible samples of the same size being chosen is equal. Sampling has also received attention as a useful tool for data integration tasks such as automated metadata discovery.
One approach to exploit random sampling is to sample data on an as-needed adhoc basis. This approach can work well within a single database management system, but can be difficult to implement in more complex warehousing and information integration contexts. FIG. 1 describes a prior art system to perform sampling in a warehousing scenario. Full-scale data warehouse 102 stores a large amount of data which is sampled as a whole by a sampler 104. Sampler 104 runs a sampling algorithm to sample the data. The sampled data is then stored in a sample data warehouse 106. This sampling architecture does not support scalable and flexible sampling. Sampler 104 samples all the data in the full-scale data warehouse in response to any queries for a portion of that data. For example, in case data stored for the month of January is desired, all the data stored in the full-scale warehouse is sampled to extract a portion of the data corresponding to the month of January. Thus, a flexible and scalable sampling infrastructure is desired.
A sampling scheme may be defined by probability mass function P (.;D) on subsets of a population D={1, 2, . . . , |D|} of distinct data elements. For a subset ⊂ D, the quantity P(S; D) is the probability that the sampling scheme, when applied to D, produces the sample S. A sampling scheme is uniform if, for any population D, the associated probability function P satisfies P(S; D)=P(S′; D) whenever S, S′⊂ D with |S|=|S′|. All samples of equal size are equally likely.
Bernoulli sampling, simple random sampling (reservoir sampling), and concise sampling are examples of some sampling schemes known in the art. A bernoulli sampling scheme Bern(q) with sampling rate qε[0, 1] includes each population data element in the sample with probability q and excludes the element with probability 1−q, independently of the other data elements. The associated probability function P is given by P(S;D)=q|S| (1−q)|D|−|S| (wherein D is an input population and S is a sample), so that bernoulli sampling is uniform. An advantage of bernoulli sampling is that collecting samples is simple and computationally inexpensive and merging bernoulli samples is a relatively straightforward process. A disadvantage of bernoulli sampling is that the size of the sample is random, and hence cannot be controlled. The size of a Bern(q) sample from a population of size N is binomially distributed with parameters N and q, so that the standard deviation of the sample size is √{square root over ((Nq(1−q)))}. Hence the variability of the sample size grows without bound as the population size increases. Not knowing the size of the population makes the selection of an appropriate sampling rate difficult. Too small a sample rate in anticipation of a large population may yield an unacceptably small sample if the population is smaller than expected. Conversely, too large a sample rate may result in a sample that exceeds memory bound targets.
Simple random sampling (without replacement) with sample size k≧1 is defined as the unique uniform sampling scheme that produces a sample of the specified size:
      P    ⁡          (              S        ;        D            )        =      {                                                    ⁢                                                            1                  ⁢                                      /                                    ⁢                                      (                                                                                                                                                    D                                                                                                                                                                            k                                                                                      )                                    ⁢                  …                  ⁢                                                                          ⁢                  if                  ⁢                                                                          ⁢                                                          S                                                                      =                k                            ;                                                                                    ⁢                          0              ⁢              …              ⁢                                                          ⁢                              otherwise                .                                                        
Reservoir sampling is an algorithm for obtaining a simple random sample based on a single sequential scan of data. The idea behind reservoir sampling is to maintain the invariant that the current reservoir constitutes a simple random sample of all data elements seen so far. Thus, the first k scanned data elements are inserted into the reservoir, so that the invariant property holds trivially. When the nth data element is scanned (n>k), this element is included in the sample with probability k/n, replacing a randomly and uniformly selected victim, and not included in the sample with probability 1−(k/n). Article entitled, “Random sampling with a reservoir”, by Vitter describes generating random skips between successive inclusions using acceptance-rejection techniques to speed up the basic reservoir algorithm. An advantage of reservoir sampling is that the sample footprint is bounded a priori. A disadvantage of reservoir sampling algorithm is its inability to merge reservoir samples. The terms “reservoir sample” and “simple random sample” are used interchangeably throughout the application.
Concise sampling as described in article titled, “New sampling-based summary statistics for improving approximate query answers” by Gibbons et al., provides a sampling method having both an a priori bounded footprint and unlike basic bernoulli and reservoir sampling, a compact representation of a sample. A sample is stored in a compact, bounded histogram representation, i.e., as a set of pairs (vi, ni) whose footprint does not exceed F bytes, where vi is the ith distinct data element value in the sample and ni is the number of data elements in the sample that have value vi. An advantage of concise sampling is that if the parent population contains few enough distinct values so that the sample footprint never exceeds F during processing, then the concise sample contains complete statistical information about the entire population in the form of an exact histogram. However, a disadvantage of concise sampling is that the samples produced are not uniform. Because concise sampling is biased toward samples with fewer distinct values, data-element values that appear infrequently in the population will be underrepresented in a sample.
A common drawback of the prior art sampling schemes is that the samples created using these schemes do not have all these properties: compact (pack a complete distribution into sample's memory if possible), uniform, bounded footprint (upper bound on memory usage), and flexibility in combining/merging.
The following references provide for a general teaching of random sampling schemes in database systems, however, none of these references teach the creation of uniform samples which are compact, have a bounded footprint, and can be flexibly combined/merged.
U.S. patent assigned to Lucent Technologies Inc., (U.S. Pat. No. 6,012,064) discusses the maintenance of a single random sample of ‘set of tuples’ in a relation, such that the sample is kept up-to-date in the presence of updates to the relation. However, warehousing multiple samples and merging such samples is not disclosed. Also, maintaining concise storage during sampling is not discussed.
U.S. patent assigned to NCR Corporation, (U.S. Pat. No. 6,564,221 B1) proposes the use of stratified sampling to sample a database in parallel, however, does not discuss warehousing (i.e. merging) issues. Also, maintaining concise storage during sampling is not discussed.
U.S. patent assigned to NCR Corporation, (U.S. Pat. No. 6,889,221 B1) proposes a specific mechanism for parallel sampling, involving careful management of seeds for pseudorandom number generators. Again, warehousing multiple samples and merging such samples is not disclosed. Also, maintaining concise storage during sampling is not discussed.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.