Computer systems incorporating Relational Database Management System (RDBMS) software using a Structured Query Language (SQL) interface are well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
A query optimizer function in the RDBMS is responsible for translating SQL statements into an efficient query execution plan (QEP). The QEP dictates the methods and sequences used for accessing tables, the methods used to join these tables, the placement of sorts, where predicates are applied, and so on. The QEP is interpreted by the RDBMS when the query is subsequently executed.
There may be a large number of feasible QEPs, even for a simple query. The optimizer determines the best of these alternatives by modeling the execution characteristics of each one and choosing the QEP that minimizes some optimization goal such as response time or use of system resources. See e.g., P. Griffiths Selinger, M. M. Astrahan, D. D Chamberlin, R. A. Lorie, and T. G. Price, “Access Path Selection in a Relational Database Management System”, Procs. 1979 ACM SIGMOD Conf. (May 1979), pp. 23-34 (hereinafter referred to as [Selinger 79]).
The optimizer may choose to minimize some estimated cost metric, such as resource consumption or elapsed time, wherein the most important factor in accurately computing any cost used during optimization is a cardinality estimate. The pioneering work in estimating the cardinality of a plan in an incremental fashion was described in [Selinger 79]. However, this work assumed that each predicate was independent and that values were distributed uniformly.
U.S. Pat. No. 4,956,774, issued September 1990 to Akira Shibamiya and R. Zimowski, entitled “Data base optimizer using most frequency values statistics” (hereinafter referred to as [Shibamiya 90], kept frequency statistics to drop the assumption of uniformity, but did not deal with the independence assumption.
U.S. Pat. No. 5,469,568, issued Nov. 21, 1995, to K. Bernhard Schiefer and Arun Swani, entitled “Method for choosing largest selectivities among eligible predicates of join equivalence classes for query optimization” (hereinafter referred to as [Schiefer 95]), derived a technique for computing cardinalities of joins only when the join (i.e., single-table) predicates and predicates whose correlation are somewhere between completely redundant and completely independent.
Rafiul Ahad, K. V. Bapa Rao, and Dennis McLeod, “On Estimating the Cardinality of the Projection of a Database Relation”, ACM Transactions on Databases, Vol. 14, No. 1 (March 1989), pp. 28-40 (hereinafter referred to as [ARM 89]), exploited multivariate distributions of the values in the database and semantic constraints to estimate the size of a query when correlations can occur, but only for a single table having no duplicate rows (which SQL allows).
Allen Van Gelder, “Multiple Join Size Estimation by Virtual Domains” (extended abstract), Proc. of ACM PODS Conference, Washington, D.C. (May 1993), pp. 180-189 (hereinafter referred to as [VG 93]), adjusted the selectivity of individual predicates based upon correlation statistics, so that the state-of-the-art techniques can be used unchanged. However, such adjustments underestimate the cardinality for the partial QEPs applying some proper subset of such correlated predicates.
Viswanath Poosala and Yannis E. Ioannidis, “Selectivity Estimation Without the Attribute Value Independence Assumption”, Proc. of the 23rd Conference on Very Large Data Bases, Athens, Greece (1997), pp. 486-495 (hereinafter referred to as [PI 97]), also exploited multi-variate distributions on two attributes only, summarized as 2-dimensional histograms that are further compressed using singular-value decomposition, but does not deal with equality predicates (the most common form of predicates, especially for joins) or correlations among more than two predicates.
Other references of interest include: B. Muthuswamy and L. Kerschberg, “A Detailed Statistical Model for Relational Query Optimization”, Proc. Of the ACM Annual Conference, Denver (October 1985), pp. 439-448 (hereinafter referred to as MK 85); and David Simmen, Eugene Shekita, and Timothy Malkemus, “Fundamental Techniques for Order Optimization”, Proc. 1996 ACM SIGMOD Conf. (May 1996), pp. 57-67 (hereinafter referred to as [Simmen 96]).
Notwithstanding these various prior art methods, there exists a need in the art for improved techniques for optimizing queries, especially through the use of estimated cardinality.
Query optimizers in relational database management systems rely on cost estimation to choose the best physical execution plan for a query. The accuracy of cost estimation is thus of critical importance because cost of plans can differ considerably and choosing a bad plan might result in extremely poor performance. One of the key elements in cost estimation for queries is the estimation of the number of distinct values in a set of attributes, and it has received considerable attention in the literature and commercial systems. For example, it may be interesting to estimate the number of groups that have the same “color” and “size” of a product. Wrong cardinality estimates often lead to poor choices for operation strategies involving the sets of attributes as well as poor access plan choices for subsequent portions of the plan.
Existing database query optimizers focus on the estimation of number of distinct values for a single attribute, while the problem of estimating the number of distinct value combinations in a set of attributes remains elusive. Conventional commercial database systems use pre-computed sets of attributes statistics that are computed by scanning the data. This could be done easily when a multi-attribute index is used to do the counting. The statistics could also be computed when the values in the attributes are “concatenated” in some way to compute the number of distinct values in a way similar to computing the distinct values in a single attribute. The main problem with pre-computed sets of attributes is that one has to know in advance which sets of attributes are of interest. This is not easily possible when the number of permutations and combinations of the sets of attributes is large. Various methodologies might be used in order to decide the interesting combinations including preprocessing the workload of queries or as some sort of feedback mechanism for future queries from looking at estimates versus the actual cardinalities of past queries.
The above methods do not lend themselves to arbitrary ad hoc queries. A method which uses the typical statistics that are collected on individual attributes used today is very primitive. In general, the product of the individual attribute distinct values produces the maximum number of possible combinations.
Accordingly, what is needed is a system and method for more accurately estimating the number of distinct values of combinations for a set of attributes in a database system. The present invention addresses such a need.