1. Field of Invention
The present invention relates generally to the field of query optimization in database access. More specifically, the present invention is related to methods for query optimization using query feedback.
2. Discussion of Prior Art
Query optimization relies heavily on accurate cardinality estimates for predicates involving multiple attributes. In the presence of only unidimensional statistics, selectivity estimates are made by assuming independence (see “Access Path Selection in a Relational Database Management system”, Selinger, P. et al., Proceedings of the ACM SIGMOD International Conference on Management of Data, pp 23-24, June 1979). However, this may lead to significant estimation errors, since database columns are often correlated (see “Selectivity Estimation Without the Attribute Value Independence Assumption”, Poosala, V. & Ioannidis, Y., Proceedings of the International Conference on Very Large Data Bases, August 1997). Although multi-dimensional histograms have been proposed to address this problem, the traditional method of building these histograms through a data scan (henceforth called the proactive method) suffers from several disadvantages. Firstly, since it requires data access, contention for the database increases. Also, such a histogram needs to be periodically rebuilt in order to incorporate database updates, thus exacerbating the overhead of this method. Further, the proactive method does not scale well to large tables. With an increasing number of dimensions, exponentially more histogram buckets are needed to capture the data distribution with the same precision. When only a limited number of histogram buckets can be stored, it becomes important to allocate buckets by taking into account workload information, i.e. to allocate more buckets for the portion of data that is more frequently queried. However, the proactive method is purely data driven, and hence cannot exploit any workload information.
To overcome these problems, an alternative method of building histograms is known using query feedback (henceforth called the reactive method). See for example, “Self-Tuning Histograms: Building Histograms Without Looking at Data”, Aboulnaga, A. & Chaudhuri, S., Proceedings of the ACM SIGMOD International Conference on Management of Data, pp181-192, 1999; “STHoles: A Multidimensional Workload-Aware Histogram”, Bruno, N., Chaudhuri, S. and Gravano, L., Proceedings of the ACM SIGMOD International Conference on Management of Data, pp 211-212, 2001; “Adaptive Selectivity Estimation Using Query Feedback”, Chen, C. & Roussopoulos, N., Proceedings of the ACM SIGMOD International Conference on Management of Data, pp161-172, 1994; “A Self-Adaptive Histogram Set for Dynamically Changing Workloads”, Lim, L., Wang, M. & Vitter, J., Proceedings of the International Conference on Very Large Data Bases, pp 369-380, September 2003.
Consider a query having a predicate make=‘Honda’, and suppose that the execution engine finds at runtime that 80 tuples from the Car table satisfy this predicate. Such a piece of information about the observed cardinality of a predicate is called a query feedback record (QF). As the DBMS executes queries, QFs can be collected with relatively little overhead and used to build and progressively refine a histogram over time. For example, the above QF may be used to refine a histogram on make by creating a bucket for ‘Honda’, and setting its count to 80.
The reactive approach remedies some problems with the proactive approach. Firstly, it reduces contention for the database, since it requires no database access. Moreover, a reactively built histogram requires no periodic rebuilding since updates are automatically incorporated by new QFs. A reactively built histogram can also be workload-aware, i.e., since more QFs are available for the portion of the data that is repeatedly being queried, the histogram can be made more accurate for this portion.
One possible disadvantage of a reactive approach is that it can only gather information for the portion of the data that has already been queried, as opposed to the proactive approach that gathers information about all data. However, this is not likely to be a serious problem in practice, due to the locality in query patterns (so that data not heavily queried in the past is not likely to be heavily queried in the future either). The pros and cons of a reactive approach versus a proactive approach are summarized in Table 1.
TABLE 1Proactive (by data scan)Reactive (by query feedback)Contends for databaseDoes not contend for databaseNeeds periodic rebuildingNo periodic rebuilding necessaryNot workload-awareWorkload-awareCovers all dataOnly covers queried data
Previous proposals for histogram construction using query feedback have lacked either accuracy or efficiency. Some proposals, e.g., STGrid, use heuristics to refine the histogram based on a new QF, thereby leading to inaccuracies in the constructed histogram (see “Self-Tuning Histograms: Building Histograms Without Looking at Data”, Aboulnaga, A. & Chaudhuri, S., Proceedings of the ACM SIGMOD International Conference on Management of Data, pp181-192, 1999). Other proposals, e.g., STHoles (referred to previously), require extremely detailed feedback from the query execution engine that can be very expensive to gather at runtime.
The reactive approach to histogram maintenance entails a number of interesting challenges:                1. Maintaining consistency: To ensure histogram accuracy, the histogram distribution must, at every time point, be consistent with all currently valid QFs and not incorporate any ad hoc assumptions; Previous proposals for feedback-driven histogram construction have lacked this crucial consistency property, even when the data is static. See for example STGrid or SASH.        2. Dealing with inconsistent QFs: In the presence of updates, deletes, and inserts, some of the QFs collected in the past may no longer be valid. Such old, invalid QFs must be efficiently identified and discarded, and their effect on the histogram must be undone.        3. Meeting a limited space budget: Database systems usually limit the size of a histogram to a few disk pages in order to ensure efficiency when the histogram is read at the time of query optimization. Thus, it is assumed that there is a limited space budget for histogram storage. In general, adding more QFs to the histogram while maintaining consistency leads to an increase in the histogram size. To keep the histogram size within the space budget, the relatively “important” QFs (those that refine parts of the histogram that are not refined by other QFs) must be identified and retained, and the less important QFs discarded.        
Existing work on multidimensional statistics can be broadly classified as addressing either the problem of deciding which statistics to build or that of actually building them. This invention addresses only the latter problem. Many types of statistics have been proposed, e.g., histograms and wavelet-based synopses; of which only histograms are of interest here.
For building multidimensional histograms, proactive approaches that involve a data scan have been proposed, e.g., MHist (see “Selectivity Estimation Without the Attribute Value Independence Assumption”, Poosala, V. & Ioannidis, Y., Proceedings of the International Conference on Very Large Data Bases, 1997), GenHist (see “Approximating Multi-Dimensional Agregate Range queries” Gunopoulos, D., Kollios, G. & Tsotras, V., SIGMOD, 2000), and others. As mentioned before, data scans may not effectively focus system resources on the user's workload and do not scale well to large tables. In principle, histograms can be constructed faster using a page-level sample of the data (see “Random Sampling for Histogram Construction: How Much is Enough?”, Chaudhuri, S., Motwani, R. & Narasayya, V., Proceedings of the ACM SIGMOD International Conference on Management of Data, 1998), but large sample sizes—and correspondingly high sampling costs—can be required to achieve sufficient accuracy when data values are clustered on pages and/or highly skewed.
The idea of using query feedback to collect the statistics needed for estimating cardinality is known in the art. Early approaches relied on fitting a combination of model functions to the data distribution; with an ad hoc choice of functions that can lead to poor estimates when the data distribution is irregular. Another approach used query feedback only to compute adjustment factors to cardinality estimates for specific predicates, and not to build histograms. STGrid (referred to previously) and SASH (see “A Self-Adaptive Histogram Set for Dynamically Changing Workloads”, Lim, L., Wang, M. & Vitter, J., Proceedings of the International Conference on Very Large Data Bases, pp 369-380, September 2003) both use query feedback to build histograms, but they often have low accuracy because their heuristic methods for adding new QFs to the histogram do not maintain consistency.
STHoles is another approach that uses query feedback to build histograms. The histogram structure of STHoles is considered to be superior to other bucketing schemes such as MHist, and for this reason is used in the present invention with some modifications. Unfortunately, the original STHoles maintenance algorithm requires, for each query and each histogram bucket, the computation of the number of rows in the intersection of the query and bucket regions. These detailed row counts, which are used to decide when and where to split and merge buckets, are usually not obtainable from the original query predicates alone. The query engine must therefore insert artificial predicates that specify the (possibly recursive) bucket boundaries. As the number of histograms and the number of buckets per histogram grows, the overhead of evaluating this “artificial feedback” becomes so high as to make the STHoles maintenance approach impractical. (The present invention, in contrast, needs only the actual feedback that naturally occurs during query execution—namely, the number of rows processed at each step during the query plan—which can be monitored with low overhead.) Finally, STHoles, unlike the present invention, does not provide principled methods for addressing issues of inconsistent feedback and limits on the available memory for storing the histogram.
The principle of maximum entropy is known, but has only been used for the significantly different problem of consistently estimating the selectivity of conjuncts of predicates such as sel(p1^p2^p3^p4), given partial selectivities such as sel(p1), sel(p2^p3), sel(p2^p3^p4), and so forth, i.e., the methods permitting the exploitation of existing multidimensional statistics (not necessarily from histograms), whereas the current invention is concerned with the collection of a specific type of statistic.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.