A query optimizer is a component within a database system that receives an incoming query and produces an execution plan for executing the query in an efficient manner. The incoming query is generally written in a declarative high-level language such as, for example, structured query language (SQL). The resulting execution plan is forwarded to the query processor for execution of the query. The optimizer may produce the execution plan by examining the physical access structures (e.g. existing table structures and indices of the database schema) as well as the distribution of data within the database. When the data or the access structures change, a pre-existing query can be re-optimized to produce a new and efficient execution plan that takes into account these changes without the need for application developers to rewrite the query.
An optimizer may use statistical models to estimate the volume of data that will be processed in each step of an execution plan. These models are accurate most of the time, but they have limitations that lead to errors in estimation, particularly in the case of complex queries. For example, the models may sometimes be generated based on statistics that are inaccurate and/or incomplete. Additionally, there may be one or more expressions within a query for which the optimizer is incapable of estimating the selectivity, which is the comparative quantity of data (e.g. percentage of rows) that will be returned for a given condition. The difficulty of estimating selectivity is generally dependent, at least in part, on the complexity of a condition within the query. These errors in turn lead to the selection of sub-optimal execution plans, which may take much longer to execute in comparison with other valid execution plans.
In addition to statistical limitations, another issue related to optimization is that, with respect to query execution, users of database system are often risk averse, meaning that they place a priority on stability. In particular, execution plans do not always perform as they are predicted, and users may be willing to sacrifice a slight improvement in the efficiency of an execution plan for an assurance that an execution plan will work adequately and will not cause major problems. For example, consider a scenario in which a query is first executed according to a first execution plan that works adequately to provide a full set of results in a timely manner. Now suppose that the same query is later re-submitted to the database and the optimization process is performed a second time. During this second optimization, a second query plan is generated that is estimated to be slightly more efficient than the first execution plan. Although this second query plan may, in fact, be more efficient than the first, the user may actually prefer for the query to be re-executed according to the first execution plan. This is because it has already been demonstrated that the first execution plan has worked effectively, and the user is willing to trade off the expected improvement in efficiency of the second plan for the proven results of the first plan.
To reduce the effects of the statistical errors and improve the stability of the optimization process, some conventional database systems employ an optimization methodology that is commonly referred to as “plan persistence.” This methodology enables an execution plan to be persisted or stored upon execution of a corresponding query. If the execution plan is successful, then it can be retrieved from memory and re-used at a later time whenever the same or a similar query is submitted to the database.
While plan persistence provides some advantages over repeating the optimization process from scratch, there are also a number of drawbacks associated with it. In particular, plan persistence does not enable a persisted plan to be validated prior to a repeat usage. This is problematic because there are a number of scenarios in which a persisted plan may become invalid between the time that it is first used and the time that it is later re-used. For example, database schemas are constantly being updated to accommodate changing data. Accordingly, it is quite possible that, while a query plan is stored or persisted, the corresponding database schema may change. For example, a data index and/or data table column may be deleted from the schema. Such changes may result in the invalidation of a plan during the time period that it is persisted. Additionally, while a plan is persisted, it may be tampered with by hackers, other unauthorized users, viruses and other security flaws, or even authorized users who access and tamper with the plan in error. This is particularly problematic when a plan is transferred over the Internet via electronic mail or is otherwise shared or distributed among various users. Furthermore, even if a persisted plan remains valid until it is retrieved for repeat usage, there are a number of scenarios in which the user may wish to edit an otherwise valid plan prior to its repeated use. For example, through experimenting or through analyzing the results of other similar queries, a user may determine that a persisted plan may be rendered more efficient through one or more local modifications or “tweaks.” Additionally, a user and/or administrator may have some specialized knowledge of an event such as, for example, a database schema change or security violation that would require a persisted plan to be edited.
Accordingly, there is a need in the art for an improved approach to the optimization process that reduces its statistical limitations and improves stability, while also avoiding the repeat use of invalid execution plans and providing flexibility and extensibility to edit execution plans prior to their repeat use. An improved approach to the optimization process with these and other advantages is set forth below.