The present invention relates generally to querying electronic data and, more particularly, to an approximate querying method for querying databases.
Approximate query processing is a term used to identify the techniques and methods for providing estimated answers to database queries. Such a database querying system is used to improve the query response performance by reducing the time required for the system to respond to queries.
Approximate querying systems provide fast responses by running queries on some form of summary statistics of the database, such as samples or histograms. Additionally, the approximate answers are often supplemented with a statistical error bound to indicate the quality of the approximation to the user (i.e., the end-user analyzing the data in the warehouse.) Since the statistics are typically much smaller in size, the query is processed very quickly. The statistics are generated on-the-fly after the query is posed, or may be precomputed a priori, as in the Approximate QUery Answering (AQUA) system disclosed in U.S. patent application Ser. No. 09/480,261, entitled Join Synopsis-Based Approximate Query Answering to Acharya et al., and U.S. patent application Ser. No. 09/081,660, entitled System and Techniques for Fast Approximate Query answering to Acharya et al., both of which are herein incorporated by reference.
A common sampling technique for summarizing data involves taking uniform random samples of the original data. Uniform random samples, in which every item in the original data set has the same probability of being sampled, are used because they mirror the original data distribution. Due to the usefulness of uniform samples, commercial database management systems (DBMSs), such as Oracle 8i, are already supporting operators to collect uniform random samples.
While uniform random samples provide highly-accurate answers for many classes of queries, there are important classes of queries for which they are less effective. These include queries where data is segmented into groups and aggregate information is derived for these groups. This is typically done in SQL using the group by operation, referred to herein as xe2x80x9cgroup-by queries.xe2x80x9d For example, a group-by query on the U.S. census database containing information about every individual in the nation could be used to determine the per capita income per state. Often, there can be a huge discrepancy in the sizes of different groups, e.g., California has nearly 70 times the population of Wyoming. As a result, a uniform random sample of the relation will contain many fewer tuples (i.e., rows of information in a database) from the smaller groups (states), which leads to poor accuracy for answers on those groups because accuracy is highly dependent on the number of sample tuples that belong to that group. This behavior often renders the answer essentially useless to the analyst, who is interested in reliable answers for all groups. For example, a marketing analyst using the Census database to identify all states with per capita incomes above some value will not find the answer useful if the aggregates for some of the states are highly erroneous.
One approach to the accuracy problem is online aggregation, described by J. M. Hellerstein et al., xe2x80x9cOnline Aggregation,xe2x80x9d Proc. ACM SIGMOD International Conf. on Management of Data, pp. 171-182, May 1997. The Online Aggregation approach employs an index striding technique to sample smaller groups at a higher rate. This approach requires significant modifications to the program code of the DBMS and slows down the query response time.
Another approach to the accuracy problem is statistical database querying, described in U.S. Pat. No. 5,878,426 to Plasek et al., entitled Statistical Database Query Using Random Sampling of Records. In this approach, individual samples are selected from each group, where the sample size of each group varies in order to achieve a desired accuracy for each group. The resulting samples are targeted for a specific partitioning into groups and a specific database attribute, and the samples for other partitions or attributes will typically lead to inaccurate answers. Moreover, a completely new sample is needed whenever new records are inserted to the database.
The inability of uniform random samples to provide accurate group-by results is a symptom of a more general problem with uniform random samples: they are most appropriate when the utility of the data mirrors the data distribution. Thus, when the utility of a subset of the data is significantly higher relative to its size, the accuracy of the answer may not meet the user""s expectation. The group-by query is one such case where a smaller group is often as important to the user as the larger groups, even though it is under-represented in the data. A multi-table query is another example: a small subset of the data in a table may dominate the query result if it joins with many tuples in other tables. The flip side of this scenario is where different logical parts of the data have equal representation, but their utility to the user is skewed. This occurs, for example, in most data warehouses where the usefulness of data degrades with time. For example, consider a business warehouse application analyzing the transactional data in the warehouse to evaluate a market for a new line of products. In this case, data from the previous year is far more important than outdated data from a decade ago. Moreover, the user is likely to ask more finer-grained queries over the more recent data. This, in turn, means that it is advantageous for the approximate answering system to collect more samples from the recent data, which is not achieved with a uniform random sample over the entire warehouse.
Accordingly, approximate querying methods which address the above limitations are desirable.
The present invention relates to providing fast, highly-accurate approximate answers for a broad range of queries, including group-by queries. In the present invention, non-uniform (i.e., biased) samples of the data to be queried are pre-computed. In particular, a database sample of uniform and biased samples is created. The database sample is then queried using approximate querying techniques. Given a fixed amount of space, the database sample maximizes an objective function for the accuracy of all possible queries. In addition, by pre-computing the database sample, tremendous advantages in speed are achieved over non-sampling methods and methods which generate samples after the query is formulated.
The database sample is created by grouping tuples within a database according to grouping attributes, determining how many tuples are needed to represent each group, and selecting the tuples from a corresponding group to create the database sample. The database sample is then queried to obtain a statistically unbiased answer. The database sample may be created and maintained without a priori knowledge of the data distribution within the database or the queries to be performed.
The present invention takes group-sizes into consideration to generate the database sample. By considering group-sizes, a compact database sample can be created which yields fast and highly-accurate answers to queries containing group-by operations on underlying data with varying group-sizes.
A single pass algorithm for constructing the database sample is disclosed. In addition, an algorithm is disclosed for incrementally maintaining the sample with up-to-date information without accessing the base relation.