1. Technical Field
The present invention relates to the field of database systems. More particularly, the present invention relates to a method for generating approximate answers in a large data warehousing environment in response to complex aggregate queries.
2. Description of the Related Art
Traditional query processing has focused solely on providing exact answers to queries in a manner seeking to minimize response time and maximize throughput. Providing an exact answer to a complex query in a large data recording and warehousing environment, however, can take minutes or even hours due to the amount of computation and disk I/O that may be required.
There are a number of scenarios in which a rapidly-obtained approximate answer is desired instead of an exact answer. For example, initial queries during a drill-down query sequence in ad-hoc data mining are used for determining interesting queries. See, for example, J. M. Hellerstein et al., Online aggregation, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 171-182, May 1997, which is incorporated by reference herein. An approximate answer can provide feedback regarding how well a query is posed. An approximate answer can also provide a tentative answer to a query when the base data of a database is unavailable. An approximate answer can be used in a query requesting a numerical answer for which the full precision of the exact answer is not needed, e.g., a total, an average or a percentage for which only the first few digits of precision are of interest. Fast approximate answers can be used in a more traditional role within a query optimizer for estimating plan costs because a fast response time is required without an exact answer.
An approximate query answering system has two key requirements: (1) an accurate estimate of the exact answer, and (2) tight bounds on the confidence of the estimate. To illustrate the obstacles in achieving the requirements of an approximate query answering system, consider a set of uniform random samples of each base relation (referred to herein as base samples) in a database. Intuitively, such a set of base samples is a natural set of synopses (samples) for an approximate query engine. Both the accuracy of the estimate and the spread of confidence bounds strongly depend on the sample size used for generating the set of uniform random samples. Unless certain statistical properties can be guaranteed for the sample, the corresponding bounds are usually particularly pessimistic.
A necessary, but insufficient condition for the join of the base samples to be a uniform random sample of a join on base relations is for the probability for any two joined tuples to be in the join of the base samples to equal the probability for the (same) two tuples to be in a sample of the join on the base relations. Generally, though, the join of two uniform random base samples is not a uniform random sample of the join on the base relations. Consequently, a straightforward sampling approach can produce a poor quality estimation when aggregates on the tuples in multi-way joins are approximated based on a statistical guarantee of the multi-way join and the join output size. In most cases, the non-uniformity introduced by the join significantly degrades the accuracy of the answer and the confidence bounds. Moreover, the join of two uniform random samples typically has very few tuples, even when the join of the base relations has many tuples, thereby leading to both inaccurate answers and poor confidence bounds because the quality of the estimate critically depends on the number of tuples on which the estimate is based.
To show that a multi-way join can produce a poor quality approximation, consider a simple 2-way (equality) join of base samples of two relations R and S on an attribute X shown in FIG. 1. The letters a and b in FIG. 1 denote tuples with values a and b for an attribute X. Each segment (denoted as an edge herein) between an a-tuple or a b-tuple on the left-half of FIG. 1 (R.X) and the a-tuple or b-tuple in the right-half (S.X) depict a tuple that is in the join of R and S. Assume that the probability for each tuple to be in a base sample is 1/r, for a given r greater than 1. As shown in FIG. 1, edges a1 and a2 are in the join if and only if both a-tuples are selected from R and the one a-tuple is selected from S. Such a sample selection occurs with probability 1/r3 because three tuples must be selected.
On the other hand, edges a1 and b1 are in the join if and only if the four tuples incident to these particular edges are selected. Such a sample selection occurs with a probability of 1/r4 because four tuples must be selected. This contrasts with the fact that for a uniform random sample of the join of R and S, the probability that both edges a1 and a2 are selected equals the probability that both edges a1 and b1 are selected. Thus, generally, for any pair of relations joining on an attribute X, any X value that occurs in each relation and occurs more than once in at least one of the relations introduces a bias such that the join of the base samples is not a uniform random sample of the output of the join.
Now consider a second problem that is related to small output sizes. Consider two relations A and B, and base samples comprising 1% of each relation. The size of a foreign key join between relations A and B is equal to the size of relation A. The expected size of the join of the base samples is, however, 0.01% of the size of relation A because for each tuple in A, there is only one tuple in relation B that joins with it, and the probability that the particular tuple in B is in the sample for B is only 1%. Generally, for a k-way foreign key join and k base samples each comprising 1/r of the tuples in their respective base relations, the expected size of the join of the base samples is 1/rk of the size of the actual join. The best known confidence interval bounds for approximate join aggregates based on base samples are quite pessimistic, as discussed by P. J. Haas, Large-sample and deterministic confidence intervals for online aggregation, Proc. 9th International Conf. on Scientific and Statistical Database Management, August 1997, and which is incorporated by reference herein. It is generally impossible to produce good quality approximate answers using samples on the base relations alone. Nevertheless, it is critical to overcome this problem because nearly all queries in a warehousing context involve complex queries have a large number of (foreign-key) joins.
Recently, there has been a flurry of work in approximate query answering. See, for example, S. V. Vrbsky et al., Approximatexe2x80x94a query processor that produces monotonically improving approximate answers, IEEE Trans. on Knowledge and Data Engineering, 5(6): 1056-1068, 1993; D. Barbarxc3xa1 et al., The New Jersey data reduction report, Bulletin of the Technical Committee on Data Engineering, 20(4):3-45, 1997; P. B. Gibbons et al., Aqua project white paper, Technical report, Bell Laboratories, Murray Hill, N.J., December 1997; P. B. Gibbons et al., Fast incremental maintenance of approximate histograms, Proc. 23rd International Conf. on Very Large Data Bases, pp. 466-475, August 1997; J. M. Hellerstein et al., supra; and P. B. Gibbons et al., New sampling-based summary statistics for improving approximate query answers, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 331-342, Jun. 1998, each of which is incorporated by reference herein. Only the work of Hellerstein et al., supra, has considered the problem of approximate join aggregates, which is an important problem because most non-trivial queries, especially for data warehousing schemas, involve joining two or more tables. For example, 13 of the 17 queries of the TPC-D benchmark involve queries on joins.
Statistical techniques for providing approximate query answering have been applied in databases for more than two decades, but primarily within the context of a query optimizer for selectivity estimation. See, for example, P. G. Selinger et al., Access path selection in a relational database management system, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 23-34, June 1979, which is incorporated by reference herein. Nevertheless, application of statistical techniques to approximate query answering has started receiving attention only recently.
Regarding approximate query answering, Hellerstein et al., supra, disclose a framework for approximate answers of aggregation queries, referred to as online aggregation, in which base data is scanned in a random order at query time and an approximate answer is continuously updated as the scan proceeds. An option is provided for gradually obtaining the fully exact answer that is not affected by database updates. This approach is relatively costly because the original data is accessed at query time. Additionally, the problems with join queries as discussed by Hellerstein et al. apply to online aggregation. That is, a large fraction of the data must be processed before errors become tolerable. Other known systems support limited online aggregation features, such as the Red Brick system, which supports running COUNT, AVG, and SUM. Nevertheless, the resulting accuracy for the Red Brick system can be quite poor because the scan order used for producing the aggregations is not random.
There have been several recent works on xe2x80x9cfast-firstxe2x80x9d query processing having a goal of quickly providing a few tuples of the exact query answer, but the focus has not been on obtaining statistically representative approximate answers. See, for example, R. J. Bayardo et al., Processing queries for first-few answers, Proc. 5th International Conf. on Information and Knowledge Management, pp. 45-52, November 1996; M. J. Carey et al., Reducing the braking distance of an SQL query engine, Proc. 24th International Conf. on Very Large Data Bases, pp. 158-169, August 1998; and G. Antoshenkov et al., Query processing and optimization in Oracle Rdb, VLDB Journal, 5(4):229-237, 1996, each of which is incorporated by reference herein.
In the APPROXIMATE query processor developed by S. V. Vrbsky et al., supra, an approximate answer to a set-valued query is any superset of the exact answer that is a subset of the Cartesian product. The query processor uses various class hierarchies for iteratively fetching blocks that are relevant to the answer, thereby producing tuples that are certain to be in the answer, while narrowing the possible classes that contain the answer. This work is quite different from a statistical approach, such as that taken by Hellerstein et al.
An approximate priority queue, for example, returns a fast, approximate min in response to an extract-min query. Approximate data structures for providing faster (approximate) answers to data structure queries, such as extract-min queries, are proposed and analyzed by Y. Matias et al., Dynamic generation of discrete random variables, Proc. 4th ACM-SIAM Symp. on Discrete Algorithms, pp. 361-370, January 1993; Y. Matias et al., Approximate data structures with applications, Proc. 5th ACM-SIAM Symp. on Discrete Algorithms, pp. 187-194, January 1994; and Y. Matias et al., Performance evaluation of approximate priorities queues, presented at DIMACS Fifth Implementation Challenge: Priority Queues, Dictionaries, and Point Sets, organized by D. S. Johnson et al., October 1996, each of which is incorporated by reference herein.
There has been a considerable amount of work developing statistical techniques for solving selectivity estimation and, more recently, for data reduction in large data warehouse environments. The three major classes of statistical techniques are sampling, histograms and parametric modeling techniques.
Regarding statistical sampling techniques, see, for example, W.-C. Hou et al., Statistical estimators for relational algebra expressions, Proc. 7th ACM Symp. on Principles of Database Systems, pp. 276-287, March 1988; R. J. Lipton, Practical selectivity estimation through adaptive sampling, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 1-12, May 1990; P. J. Haas et al., On the relative cost of sampling for join selectivity estimation, Proc. 13th ACM Symp. on Principles of Database Systems, pp. 14-24, May 1994; R. J. Lipton et al., Query size estimation by adaptive sampling, J. Computer and System Sciences, 51(1):18-25, 1995; P. J. Haas et al., Sampling-based estimation of the number of distinct values of an attribute, Proc 21st International Conf. on Very Large Data Bases, pp. 311-322, September 1995; and S. Ganguly et al., Bifocal sampling for skew-resistant join size estimation, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 271-281, June 1996, each of which is incorporated by reference herein.
For statistical techniques involving histograms, see, for example, P. R. Kooi, The Optimization of Queries in Relational Databases, Ph.D thesis, Case Western Reserve University, September 1980; V. Poosala et al., Improved histograms for selectivity estimation of range predicates, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 294-305, June 1996; and V. Poosala, Histogram-based estimation techniques in databases, Ph.D thesis, University of Wisconsin-Madison, 1997, each of which is incorporated by reference herein.
Regarding statistical parametric modeling techniques, see, for example, C. M. Chen, Adaptive selectivity estimation using query feedback, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 161-172, May. 1994, which is incorporated by reference herein.
A survey of various statistical techniques is provided by Barbaxc3xa1 et al., supra. A framework for studying synopsis data structures for massive data sets is provided by Gibbons et al., Synopsis data structures for massive data sets, DIMAS: Series in Discrete Mathematics and Theoretical Computer Science, AMS., 1998, which is incorporated by reference herein. Two sampling-based synopses, concise samples and counting samples, which can be used for obtaining larger samples for the same space and for improving approximate query answers for hot list queries are disclosed by Gibbons et al., New sampling-based summary statistics for improving approximate query answers, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 331-342, June 1998, and which is incorporated by reference herein.
Maintenance algorithms also exist for samples and histograms. For maintenance algorithms for samples, see, for example, F. Olken et al., Maintenance of materialized view of sampling queries, Proc. 8th IEEE International Conf. on Data Engineering, pp. 632-641, February 1992; and P. B. Gibbons et al., New sampling-based summary statistics for improved approximate query answers, Proc. ACM SIGMOD International Conf. on Management of Data, pp. 331-341, June 1998, each of which is incorporated by reference herein. For maintenance algorithms for samples and histograms, see, for example, P. B. Gibbons et al., Fast incremental maintenance of approximate histograms, Proc. 23rd International Conf on Very Large Data Bases, pp. 466-475, August 1997, and which is incorporated by reference herein. These maintenance techniques, however, are applicable only to statistics on base relations.
In view of the foregoing, what is needed is a technique that rapidly generates accurate approximate answers in a data warehouse environment in response to complex aggregate queries based on statistical summaries of the full data of a database. What is also needed is a technique for generating high-confidence error bounds for the approximate answers, and an efficient maintenance technique for maintaining the statistical summaries in the presence of updates to the database.
The present invention provides a technique that rapidly generates accurate approximate answers in a data warehouse environment in response to complex aggregate queries based on statistical summaries of the full data of a database. The present invention also provides a technique for generating high-confidence error bounds for the approximate answers, and an efficient maintenance technique for maintaining the statistical summaries in the presence of updates to the database.
The advantages of the present invention are provided by a method for generating an approximate answer to a query in a database environment in which the database has a plurality of base relations. According to the invention, a query relating to a database, such as a complex and/or aggregate query, is received, and an approximate answer to the query is generated such that the approximate answer is based on at least one join synopsis formed from the database. The method of the present invention further includes steps of forming a sample-tuple set for at least one selected base relation of a plurality of base relations of a database such that each sample-tuple set contains at least one sample tuple from a corresponding base relation, and forming a join synopsis set for each selected base relation such that each join synopsis set contains at least one join synopsis for each sample tuple in a sample-tuple set. A join synopsis of a sample tuple is based on a join of the sample tuple and all descendent relations of the sample tuple. A relation R is a descendent of a relation S if there is a sequence of relations starting with R and ending with S such that there is a foreign key join possible between each consecutive pair in the sequence. All join synopsis sets form a statistical summary of the database and are stored.
The characteristics of a query workload can be assumed so that storage space is allocated among each respective synopsis by one of dividing an allotted storage space equally between join synopsis sets, dividing an allocated storage space between the join synopsis sets in proportion to a cube root of a join synopsis tuple size of each respective join synopsis set, and dividing an allocated storage space between the join synopsis sets in proportion to a join synopsis tuple size of each respective join synopsis set.
A way for reducing the amount of storage space needed is by eliminating redundant columns of each join synopsis, and storing only selected columns of each join synopsis. Further, the tuples of each join synopsis set can be renormalized into constituent relations of the join synopsis sets, and duplicative tuples with each constituent relation can be removed.
The method of the present invention also provides the steps of determining a fraction of queries in a query set for which each relation of the plurality of base relations of the database is one of a source relation in a join or a sole relation in a query without joins, and dividing an allotted storage space among join synopsis sets to minimize an average relative error over the queries based on a high-level characterization of the query set. A relation is a source relation in a join if it is not a descendant relation of any other relation in the query.
A confidence bound can be generated for the approximate answer that is based on one of a Hoeffding bound, a Chebychev (conservative) bound, a Chebychev (estimated "sgr") bound and a Central Limit Theorem bound. Alternatively, the confidence bound can be generated by partitioning the join synopsis sets into a predetermined number of subsets, and generating an estimator for each subset. Preferably, the subsets are the same size. In one aspect of the present invention, an average of the estimators for the subsets is determined, and the confidence bound is generated based on the average of the estimators. In another aspect of the present invention, a median of the estimators for the subsets is determined, and the confidence bound is generated based on the median of the estimators.
The present invention also includes steps of adding a new tuple to the sample-tuple set for a selected base relation with a probability p when the new tuple is inserted into the selected base relation, such that probability p is related to a ratio of a number of tuples in the sample-tuple set to a number of tuples in the selected base relation, and forming a join synopsis corresponding to the new tuple when the new tuple is added to the sample-tuple set for the selected base relation, such that the join synopsis for the new tuple being based on a foreign key join of the new tuple and all descendent relations of the new tuple. When a target size for the sample-tuple set is exceeded, a tuple of the sample-tuple set is uniformly selected at random for the selected base relation. The selected tuple is removed from the sample-tuple set for a selected base relation and the join synopsis corresponding to the removed tuple is removed. A tuple is removed from the sample-tuple set for a sample base relation when the tuple is removed from the selected base relation and is contained in the sample-tuple set for the selected base relation. Additionally, the join synopsis corresponding to the removed tuple is removed. When a target minimum size for a sample tuple is equaled, the sample-tuple set is repopulated by rescanning the selected base relation, and a join synopsis is formed for each tuple selected by rescanning the selected base relation.