In recent years, decision support applications such as On Line Analytical Processing (OLAP) and data mining tools for analyzing large databases have become popular. A common characteristic of these applications is that they require execution of queries involving aggregation on large databases, which can often be expensive and resource intensive. Therefore, the ability to obtain approximate answers to such queries accurately and efficiently can greatly benefit these applications. One approach used to address this problem is to use precomputed samples of the data instead of the complete data to answer the queries. While this approach can give approximate answers very efficiently, it can be shown that identifying an appropriate precomputed sample that avoids large errors on any arbitrary query is virtually impossible, particularly when queries involve selections, GROUP BY and join operations. To minimize the effects of this problem, previous studies have proposed using the workload to guide the process of selecting samples. The goal is to pick a sample that is tuned to the given workload and thereby insure acceptable error at least for queries in the workload.
Previous methods of identifying an appropriate precomputed sample suffer from three drawbacks. First, the proposed solutions use ad-hoc schemes for picking samples from the data, thereby resulting in degraded quality of answers. Second, they do not attempt to formally deal with uncertainty in the expected workload, i.e., when incoming queries are similar but not identical to the given workload. Third, previous methods ignore the variance in the data distribution of the aggregated column(s).
One type of method for selecting a sample is based on weighted sampling of the database. Each record t in the relation R to be sampled is tagged with a frequency ft corresponding to the number of queries in the workload that select that record. Once the tagging is done, an expected number of k records are selected in the sample, where the probability of selecting a record t (with frequency ft) is k*(ft/Σufu) where the denominator is the sum of the frequencies of all records in R. Thus, records that are accessed more frequently have a greater chance of being included inside the sample. In the case of a workload that references disjoint partitions of records in R with a few queries that reference large partitions and many queries that reference small partitions, most of the samples will come from the large partitions. Therefore there is a high probability that no records will be selected from the small partitions and the relative error in using the sample to answer most of the queries will be large.
Another sampling technique that attempts to address the problem of internal variance of data in an aggregate column focuses on special treatment for “outliers,” records that contribute to high variance in the aggregate column. Outliers are collected in a separate index, while the remaining data is sampled using a weighted sampling technique. Queries are answered by running them against both the outlier index as well as the weighted sample. A sampling technique called “Congress” tries to simultaneously satisfy a set of GROUP BY queries. This approach, while attempting to reduce error, does not minimize any well-known error metric.