1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular, to the optimization of queries using incremental estimates of cardinality for derived relations when statistically correlated predicates are applied.
2. Description of Related Art
Computer systems incorporating Relational DataBase Management System (RDBMS) software using a Structured Query Language (SQL) interface are well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
A query optimizer function in the RDBMS is responsible for translating SQL statements into an efficient query execution plan (QEP). The QEP dictates the methods and sequence used for accessing tables, the methods used to join these tables, the placement of sorts, where predicates are applied, and so on. The QEP is interpreted by the RDBMS when the query is subsequently executed.
There may be a large number of feasible QEPs, even for a simple query. The optimizer determines the best of these alternatives by modeling the execution characteristics of each one and choosing the QEP that minimizes some optimization goal such as response time or use of system resources. See, e.g., P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, xe2x80x9cAccess Path Selection in a Relational Database Management Systemxe2x80x9d, Procs. 1979 ACM SIGMOD Conf. (May 1979), pp. 23-34, incorporated by reference herein, (hereinafter referred to as [Selinger 79]).
The optimizer may choose to minimize some estimated cost metric, such as resource consumption or elapsed time, wherein the most important factor in accurately computing any cost used during optimization is a cardinality estimate. The pioneering work in estimating the cardinality of a plan in an incremental fashion was described in [Selinger 79]. However, this work assumed that each predicate was independent and that values were distributed uniformly.
U.S. Pat. No. 4,956,774, issued September 1990 to Akira Shibamiya and R. Zimowski, entitled xe2x80x9cData base optimizer using most frequency values statisticsxe2x80x9d, incorporated by reference herein, (hereinafter referred to as [Shibamiya 90]), kept frequency statistics to drop the assumption of uniformity, but did not deal with the independence assumption.
U.S. Pat. No. 5,469,568, issued Nov. 21, 1995, to K. Bernhard Schiefer and Arun Swani, entitled xe2x80x9cMethod for choosing largest selectivities among eligible predicates of join equivalence classes for query optimizationxe2x80x9d, incorporated by reference herein, (hereinafter referred to [Schiefer 95]), derived a technique for computing cardinalities of joins only when the join (i.e., multi-table) predicates were completely redundant, i.e., implied by other predicates given by the user, but did not deal with local (i.e., single-table) predicates and predicates whose correlation are somewhere between completely redundant and completely independent.
Rafiul Ahad, K. V. Bapa Rao, and Dennis McLeod, xe2x80x9cOn Estimating the Cardinality of the Projection of a Database Relationxe2x80x9d, ACM Transactions on Databases, Vol. 14, No. 1 (March 1989), pp. 28-40, incorporated by reference herein, (hereinafter referred to as [ARM 89]), exploited multi-variate distributions of the values in the database and semantic constraints to estimate the size of a query when correlations can occur, but only for a single table having no duplicate rows (which SQL allows).
Allen Van Gelder, xe2x80x9cMultiple Join Size Estimation by Virtual Domainsxe2x80x9d (extended abstract), Procs. of ACM PODS Conference, Washington, D.C. (May 1993), pp. 180-189, incorporated by reference herein, (hereinafter referred to as [VG 93]), adjusted the selectivity of individual predicates based upon correlation statistics, so that the state-of-the-art techniques can be used unchanged. However, such adjustments under-estimate the cardinality for the partial QEPs applying some proper subset of such correlated predicates.
Viswanath Poosala and Yannis E. Ioannidis, xe2x80x9cSelectivity Estimation Without the Attribute Value Independence Assumptionxe2x80x9d, Proc. of the 23rd Conference on Very Large Data Bases, Athens, Greece (1997), pp. 486-495, incorporated by reference herein, (hereinafter referred to as [PI 97]), also exploited multi-variate distributions on two columns only, summarized as 2-dimensional histograms that are further compressed using singular-value decomposition, but does not deal with equality predicates (the most common form of predicates, especially for joins) or correlations among more than two predicates.
Other references of interest include: B. Muthuswamy and Larry Kerschberg, xe2x80x9cA Detailed Statistical Model for Relational Query Optimizationxe2x80x9d, Procs. of the ACM Annual Conference, Denver (October 1985), pp. 439-448, incorporated by reference herein, (hereinafter referred to as MK 85); and David Simmen, Eugene Shekita, and Timothy Malkemus, xe2x80x9cFundamental Techniques for Order Optimizationxe2x80x9d, Procs. 1996 ACM SIGMOD Conf. (May 1996), pp. 57-67, incorporated by reference herein, (hereinafter referred to as [Simmen 96]).
Notwithstanding these various prior art methods, there exists a need in the art for improved techniques for optimizing queries, especially through the use of estimated cardinality.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for incrementally estimating the cardinality of a derived relation when statistically correlated predicates are applied. A plurality of query execution plans (QEPs) are generated for the query. During the generation of the QEPs, a cardinality is computed for any of the QEPs in which two or more predicates are correlated to each other. The cardinality comprises a number of rows expected to be returned by the QEP and is computed in an incremental fashion for each operator of the QEP. The computations include calculations that may be done prior to the generation of the QEPs and calculations that are necessarily done as each operator of a QEP is added to that QEP. Thereafter, one of the QEPs is chosen to satisfy the query in a manner that minimizes an estimated cost metric, wherein the cost metric is computed using the cardinality.