One of the most difficult aspects of database query optimization is the balancing act necessary to find the best strategy to run the query without taking too long to find it. A query optimizer is a component of a database management system that attempts to determine the most efficient way to execute a query. The output of an optimizer is typically referred to as a query plan or access plan, which is a form of executable code that can be processed by a database engine to execute the query. Many optimizers operate by selecting or generating multiple potential query plans for a given query, and selecting from the potential query plans an optimal query plan.
Cost-based query optimizers typically operate by assigning an estimated “cost” to each possible query plan, and then choosing the plan with the least cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, the CPU requirements, and other factors.
A query optimizer internally has a number of strategies that it uses to generate the set of query plans examined. The strategies, such as join strategy, union strategy, index strategy, grouping strategy, ordering strategy, etc., may be called recursively, so a join strategy may call an ordering strategy for a leg which in turn may call an indexing strategy which may call another strategy. Each strategy examines the possible access paths (e.g. index scan, sequential scan) and join algorithms (e.g. sort-merge join, hash join, nested loops). Thus, the search space can become quite large depending on the complexity of the query.
Given enough time, a query optimizer should be able to find a best plan by evaluating all possible query plans. However, in many cases it is impossible, or at least inadvisable, to try all possibilities. Depending on the complexity of the query, the search space of query plans for the optimizer could be so large that the time required to optimize the query could potentially exceed that amount of time necessary for an unoptimized query to complete. Thus, the resultant increase in optimization time would be unacceptable for many customers. Therefore, it is often only possible to attempt a subset of the query plans available for a given query, which may result in the selection of a suboptimal plan. A goal is to accomplish as much optimization as possible in the shortest amount of time. This may result in a need to defer some of the optimization tasks.
Optimization consists of multiple steps including generating query plans, collecting statistics on the data related to the query plans, using the statistics to estimate resource costs of the plan, and selecting a plan with the optimum resource costs. One of the resource intensive tasks of the optimization is collecting the statistics. Statistical information for the data in the underlying database may relate to tables and their indices, objects in a schema, objects in a database, etc. While some statistics may be collected with relatively low overhead, other, more detailed statistics may require “deep” statistics collections that may require extensive amounts of resources. In many instances, however, the resource requirements for some deep statistics collections preclude their performance on production systems due to time and resource constraints. When collecting statistics for a query in a production environment, there is a need to look at the current CPU utilization, the resources available, and then try to do the statistics collection without unduly loading the system. If a full statistics collection will overly load the system, a partial statistics collection may be executed, deferring the remaining collection for later.
An example of an approach that often provides suboptimal plans as a result of this crunch for time and resource is a plan based on an indexing strategy. Indexing is an optimization technique that uses indices built over columns to provide quicker access to the necessary data. The indexing strategy problems that are caused by the rush to get through optimization are twofold. First, in order to use an index, the optimizer must be able to represent the query's predicates in such a way that it can most efficiently and properly match with the correct indices. To accomplish this, predicates are represented in their disjunctive normal form (“DNF”). Unfortunately, building the DNF for a set of predicates can be very time consuming, as the time needed for DNF creation rises dramatically as the number of predicates, as well as the complexity of the predicates involved, increases. Therefore, depending on the set of predicates involved, it is often far too time-consuming to consider, so the DNF conversion is not done, and the indexing strategy, in turn, cannot be attempted.
Second, even if the DNF is created, it is often the case that the optimizer is unable to try all possible index combinations. In most cases, the columns that are covered by each index do not match up perfectly with columns referenced by the queries. It is especially common to have an index, which covers only a subset of the columns referenced by the query. In such a case, it can be beneficial to use multiple indices together to provide access to all necessary columns. One of the most important jobs of an optimizer can be to find a plan with the best possible combination of indices to include all necessary columns. However, as the number of columns in a table and the number of indices over the table increase, the time taken to run this indexing strategy increases exponentially, so the optimizer must employ some strategy to minimize the time taken. Often, this strategy is generally to search through the plans containing the index combinations until one is found that is “good enough”, which often simply means that all columns are covered, but not necessarily with the optimal combination.
Several strategies exist for solving the problem of optimization techniques taking too long. The primary strategy for handling situations in which optimization time may become overly time consuming, as with the indexing strategy above, is simply to find a plan that is “good enough”, or sufficient to provide a reasonably well-behaved plan. In other words, the optimizer will work until it finds a plan that works, but won't necessarily continue searching for better performing plans. Unfortunately, with very large databases or complex queries, even a plan that is considered “good enough” may be far inferior to other plans, which are not even attempted.
Another strategy that is often employed is simply not to even try a technique if it is deemed to be too time-consuming for the given query. In the case above, the predicates are given a score based on the complexity of the predicates, the number of predicates, and how the predicates are combined. If the complexity score is too high, the optimizer does not even attempt to create a DNF, thus saving time, but possibly overlooking a number of strong plans. This creates the obvious problem of completely dismissing optimization techniques and potentially overlooking the best performing plans.
Still another strategy that is used is to save a plan for a given query and, if possible, reuse or expand on this plan in future runs of the same query. The problem with this is that it is only helpful for future runs of the exact same query. Even with minimal changes to the query, the optimizer will be forced to completely restart each optimization strategy. Another problem with this strategy is that the entire query plan is saved, rather than just the portions that are unfinished or can be expanded on. This requires extra space on the system and may cause optimization problems when saving and loading the cached plans.