The present invention relates to the field of database optimizing and, more particularly, to automatically recommending statistical views (statviews) using cost/benefit metrics for improved intermediate result estimations during query optimization operations.
A query optimizer is the component of a database management system that attempts to determine the most efficient way to execute a query. The optimizer considers the possible query plans for a given input query, and attempts to determine which of those plans will be the most efficient. Cost-based query optimizers assign an estimated cost to each possible query plan, and choose the plan with the smallest cost. The set of query plans can be formed by examining the possible access paths (e.g. index scan, sequential scan) and join algorithms (e.g. sort-merge join, hash join, nested loops).
The performance of a query plan is determined largely by the order in which the tables are joined. For example, when joining three tables A, B, C of size 10 rows, 1,000,000 rows, and 1,000,000 rows, respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to execute than one that joins A and C first. Joining A and C first could have the advantage of a smaller size result that can in turn be joined to B more efficiently than if we were to join the two big tables first. Most query optimizers determine join order via a dynamic programming algorithm, although other optimization algorithms exist and can be utilized.
Cost-based query optimizers rely on a cost model to choose the best possible query plan for a given query. The accuracy of cost estimates can directly affect the quality of the query plan. That is, when costs estimates are inaccurate, non-efficient plans can be selected over more efficient ones, which can significantly affect query performance. Cost estimates depend mainly on cardinality estimations of various sub-plans (intermediate results) generated during optimization.
Statistics that capture the logical characteristics and physical layout of the data are important to cost based optimizers for accurately estimating the execution cost of the access plans. Logical characteristics can include, but are not limited to, the sizes of the tables, the number of distinct values in a column of a table, the frequency and histogram statistics that capture the distribution of the data values in the columns, and the like. Physical characteristics can include, but are not limited to, the clustering and the number of pages that a table or index occupies.
Using basic statistics on base tables requires adopting unrealistic assumptions to estimate the cardinalities of intermediate results, which usually cause large estimation errors that can be off by several orders of magnitude. For example, the basic statistics might give the optimizer the information about the individual columns but may not capture correlation of values in one column with the values in another column. Modern commercial database systems support statistical or sample views, which give more accurate statistics on intermediate results and query sub-expressions. There is currently no mechanism to automatically recommend the most beneficial statistical views to construct for estimating intermediate results during query optimization operations.