Estimating the number of distinct values for some target attribute in a data set is a well-studied problem. The statistics literature refers to this as the problem of estimating the number of species or classes in a population. Estimates of the number of distinct values in a column are commonly used in query optimizers to select good query plans. In addition, histograms within the query optimizer commonly store the number of distinct values in each bucket, to improve their estimation accuracy. Distinct-values estimates are useful for network monitoring devices, in order to estimate the number of distinct destination Internet Protocol (IP) addresses, source-destination pairs, requested Universal Resource Locators (URLs), etc.
Estimating the number of distinct values in a data set is a special case of the more general problem of approximate query answering of distinct value queries, i.e., “count distinct” queries. Approximate query answering is becoming an indispensable means for providing fast response times to decision support queries over large data warehouses. Fast, approximate answers are often provided from small synopses of the data, such as samples, histograms, wavelet decompositions, etc. Commercial data warehouses are approaching 100 terabytes, and new decision support arenas, such as click stream analysis and IP traffic analysis, only increase the demand for high-speed query processing over the terabytes of data. Thus, it is crucial to provide highly-accurate approximate answers to an increasingly rich set of queries.
Distinct value queries are an important class of decision support queries, and good quality estimates for such queries may be returned to users as part of an online aggregation system or an approximate query answering system. Because the answers are returned to the users, the estimates must be highly-accurate (such as being within 10% or better with 95% confidence), and supported by error guarantees. Unfortunately, none of the previous work in approximate query processing provides fast, provably good estimates for common distinct values queries.
In addition, users are also requiring that systems that provide estimates for the distinct value queries that also have the capability to accommodate distinct value queries that have predicates. Predicates allow users to filter or target the distinct value queries to the specific estimates they need in order to operate more effectively. However, the distinct value queries having predicates must also be fast and highly-accurate. This can be difficult due to the terabytes of data the queries are being applied to.
Accordingly, what is needed in the art is a system for distinct sampling that can accommodate distinct value queries having predicates and overcomes the deficiencies of the prior art.