1. Field of the Invention
The present invention relates to the field of query optimization and, more specifically, to estimating cardinality based on query execution feedback.
2. Description of the Prior Art
A query optimizer is used to determine the best plan for executing the query. To select such an execution plan, the optimizer enumerates a collection of candidate plans, and picks the one with the least anticipated execution cost. An execution plan is composed of a number of steps of data processing, whose cost is derived based on the particular execution algorithm considered, and the estimated number of rows to process. The quality of plans generated by the optimizer is tied to the accuracy of its cost estimation. Incorrect estimation may lead the optimizer to regard some plans as efficient, when in reality they are very expensive to execute. As effective optimization and good physical design can introduce dramatic performance improvements, so selecting the wrong execution plan can lead to dramatic slowdowns.
In conventional query optimization methods, the number of rows to be used in a query is generally estimated using statistics information about the data stored in the database. Such statistics are typically gathered by executing special purpose queries or processes, which read all or part of a database table, perform some analysis, and store the results for later use by the query optimizer. Statistics gathering may be triggered automatically based on the columns used to execute a query.
Gathering statistics for the purpose of query optimization has a number a drawbacks. For example, separate special purpose processes to collect statistics overlap considerably with regular query execution and the resulting overhead may be significant. For systems that trigger such processes automatically, the processes may introduce unexpected increases to the system load.
Furthermore, statistics that are gathered may be over-inclusive or under-inclusive with respect to queries. Statistics-gathering processes are generic and they collect information that is not particularly tied to the current system workload. Thus, statistics may be unnecessarily gathered for tables or portions of tables that are not requested by the queries. For example, statistics may be unnecessarily gathered for a portion of a table covering the year 2001 when queries request data from only from the year 2002. Additionally, if statistics exist for two columns, then statistics on the combination of the columns may be needed if the information is correlated. However, it is impossible to determine whether the information is correlated simply by looking at the query syntax and examining all pairs of columns in the schema is cost prohibitive.
Another drawback of statistics gathering is that, as the database changes, statistics become invalid and must be updated. In manual update systems, statistics are typically updated by database administrators, who generally do not have tools to measure the degree of staleness of statistics and the impact of such staleness on plan selection. In automatic systems, statistics are updated based on a number of changes made to base tables. Such automatic systems employ simple counters and heuristics to decide when it is necessary to re-compute statistics, thereby requiring extra processing. In both manual and automatic systems, statistics are sometimes computed when there is no need and are also sometimes not computed when there is a need.
Yet another drawback of statistics gathering is that, when such statistics are used in cardinality estimation, errors grow as estimation is done on top of estimation, so that after several filters and joins, the estimated cardinality may be way off the actual. In addition, there are constructs that simply cannot be estimated based on statistics of base table columns. The standard approach when such constructs are encountered is to use a “guess” or “magic number,” such as the well known ⅓ data reduction factor for inequality comparisons and 1/10 data reduction factor for equality.
The problem with inaccurate estimation is not only performance, but the introduction of erratic behavior. Queries are “unstable” when there are estimation errors in their optimization. The reason is switch-over points. For example, it is known that index lookup is efficient when there are few rows to lookup, and table scan is preferable if access to all the rows is expected; for cases in between, a choice must be made and there is a switch-over point between the two alternatives. Similar switch-over points occur in join order, and in selection of various execution algorithms. When a query is close to a switch-over point, minor changes can make it choose one or the other option, which is acceptable as long as the estimation is correct. However, if a query is incorrectly estimated close to a switch-over point, the plan generated will appear to be randomly chosen, with vastly different performance.
Adding or removing a simple, non-selective condition in a query may result in dramatic execution slowdown; or a query that used to run well may suddenly become very slow, after some rows are added to one of the base tables. Two queries that are very close in form and semantics may perform very differently, if one of them happens to use a construct that is not supported by the cardinality estimation model. Software upgrades, and even re-computation of statistics can introduce unpredictable changes in query plans and performance degradation. Such behavior confuses and frustrates developers and database administrators. Conventional systems fail to deliver on the goal of high level, declarative database languages.
Thus, there is a need in the art for systems and methods for cardinality estimation in which statistics are created and updated automatically based on continuous observation of query execution. It is desired that such systems and methods reduce the overhead involved with statistics gathering, improve the relevance of statistics to a workload, reduce sensitivity of statistics to data changes, and improve the accuracy of cardinality estimation.