1. Field of the Invention
This invention generally relates to the field of data sampling and more specifically to data sampling of relational databases using Bernoulli sampling.
2. Description of Related Art
The vast amount of data in modern warehouses and on the web poses a major challenge for analytical, mining, and statistical applications. Many data analysis algorithms simply do not scale to the hundreds of terabytes of data often found in modern repositories. At the same time, users are demanding that decision support systems be increasingly fast, flexible, and responsive. This pressure by users arises both from the ever-increasing pace of e-business and from the development of applications that support real-time interaction with data, such as spreadsheets and On-Line Analytical Processing (OLAP) tools. Although increases in CPU and disk speeds are helpful in dealing with massive data, hardware improvements alone do not suffice. Indeed, there is evidence that computer systems are getting slower in that the volume of online data is growing at a rate faster than Moore's law.
In light of this situation, it has become increasingly apparent that sampling techniques are a crucial component of a modern database management system (DBMS). These techniques permit the computation of approximate query results—which often suffice in practice—in a fraction of the time required to compute an exact answer.
Recognizing the importance of sampling, the International Organization for Standardization (ISO) has been developing a standard for extending the Structured Query Language (SQL) query language to handle sampling in user queries. Under the proposed ISO sampling standard, each table reference in an SQL query may be followed by a sampling clause.
The current sampling methods supported by the standard are “row-level Bernoulli sampling” and a vendor-defined sampling method. For row-level Bernoulli sampling with sampling rate Q ∈ (0, 1], each row is included in the sample with probability Q and excluded with probability 1-Q, independently of the other rows. In current commercial systems, the vendor-defined sampling method is typically implemented as a page-level Bernoulli sampling scheme. That is, for each page, all of the rows on the page are included in the sample with probability Q and all of the rows are excluded with probability 1-Q, independently of the other pages. Bernoulli schemes are particularly well suited to database sampling because (1) Bernoulli sampling is easy to parallelize and (2) in the case of row-level sampling, the sampling operator can be treated essentially as just another predicate by the query optimizer and hence can be costed, pushed down to the lower levels of the query execution tree, and so forth.
This new database sampling functionality, while potentially quite effective for dealing with massive datasets, is not always easy to exploit. In particular, choosing the appropriate sampling method for a particular query-processing task can be decidedly nontrivial. Row-level sampling yields precise results but is often very expensive to execute, whereas page-level sampling is much faster, but yields results that are much less precise. Current commercial DBMS products do not provide guidance on which sampling method to choose for a specified problem. If the user chooses row-level Bernoulli sampling, then chances are that the query will take too long to complete. If, on the other hand, the user chooses page-level Bernoulli sampling, then the query will execute quickly but the result may well be too imprecise to be useful—in this case, the user will need to re-execute the query using a higher sampling rate, in a painful process of trial and error. Therefore a need exists to overcome the problems with the prior art as discussed above, and particularly for a way to automatically and more efficiently sample data from large relational databases.