1. Field of the Invention
The present invention relates to optimization in a database system.
2. Background
In a database system, optimization is the process of choosing an efficient way to execute a database query or manipulation action. Examples of such query or manipulation actions include searching, retrieving, modifying, organizing, adding, and/or deleting information from the database. These database query/manipulation actions are normally initiated by submitting commands to a database server in a database query language. One popular database query language is known as the Structured Query Language (xe2x80x9cSQLxe2x80x9d). For the purposes of explanation only, and not by way of limitation, the following description is made with particular reference to database statements involving SQL.
To execute a database query language statement (e.g., a SQL statement), the database system may have to perform steps involving the retrieval or manipulation of data from various database structures, such as tables and indexes. Often, there exists many alternate ways to execute the SQL statement. For example, a single SQL statement can be executed in different ways by varying the order in which tables and indexes are accessed to execute the statement. The exact combination and order of steps taken to execute the SQL statement can drastically change the efficiency or speed of execution for the statement. The combination and order of steps that are used to execute a SQL statement is referred to as an xe2x80x9cexecution plan.xe2x80x9d
As an example, consider the following SQL statement, which queries for the name of all employees having a salary equal to 100 from a database table xe2x80x9cemp_tablexe2x80x9d:
SELECT employee_name
FROM emp_table
WHERE salary=100
A first execution plan could include the step of performing a full table scan of emp_table to execute the query. This first execution plan would retrieve every row from emp_table to identify particular rows that match the WHERE clause. Alternatively, if an index exists for the xe2x80x9csalaryxe2x80x9d column of emp_table, then a second execution plan could involve accessing the index to identify rows that match the WHERE clause, and thereafter retrieving only those identified rows from the table. The index is considered an alternate xe2x80x9caccess pathxe2x80x9d to the data sought by the SQL statement.
Each execution plan has a xe2x80x9ccostxe2x80x9d that is associated with its execution. The cost of an execution plan can be expressed in terms of the resources that are consumed to execute the SQL statement using that execution plan. For example, the cost of an execution plan can be expressed in units of I/O usage, CPU usage, network usage, memory usage, or a single numerical value that combines several of these units.
An xe2x80x9coptimizerxe2x80x9d is used by a database system to choose what is believed to be the most efficient execution plan for a SQL statement. A xe2x80x9ccost-basedxe2x80x9d optimizer bases its decision upon the costs of each execution plan. The cost-based optimizer typically generates a set of potential execution plans for the SQL statement based upon available access paths for the data sought to be operated upon by that statement. The cost is then estimated for each execution plan based upon, for example, data distribution and storage characteristics for database structures holding relevant data for the SQL statement. The optimizer then compares relative costs of the execution plans to choose the one with the smallest cost.
The cost-based optimizer may use statistics to estimate the cost of the execution plans. Statistics are used to quantify the data distribution and/or storage characteristics of data in database structures. For example, with reference to the SQL statement example set forth above, statistics may be kept for the distribution of values in the xe2x80x9csalaryxe2x80x9d column of the table xe2x80x9cemp_table.xe2x80x9d Selectivity estimates can be performed by taking into account the data skew of data values. Selectivity is normally calculated with reference to the statistics, and can be stated as the percentage of entries within a schema object that satisfies a given predicate.
The cost of an execution plan can be estimated based upon the statistics and selectivity associated with terms within the SQL statement predicate. As an example, consider if an index exists upon the xe2x80x9csalaryxe2x80x9d column for the above SQL statement example. If so, then the following is an example of a cost calculation that can be used to estimate the cost of an execution plan that uses an index to execute the above SQL statement:
COST=(cost of access for a single row)*(selectivity)*(number of rows in table)+(cost of index access)
An example of a cost calculation for an execution plan that performs a full table scan is expressed as follows:
COST=(cost of access for a single row of table)*(number of rows in table)
Based upon such cost calculations, an optimizer can make a determination as to which of these execution plans is relatively less costly.
Optimization of a database statement normally occurs at compilation time. If a predicate in the database statement contains one or more arguments, then the values of the arguments must be known at compile-time for optimizers to make an effective estimation of selectivity for the predicate, and thus the cost for the database statement. Unfortunately, if the values of one or more arguments are not known at compile-time, then these values cannot be passed to the optimizer. Without this information, conventional optimizers have great difficulty generating accurate predicate selectivity or cost for the database statement.
Consider the following SQL statement:
SELECT *
FROM emp_table
WHERE emp_number less than :x
This database statement queries for all entries from table emp_table in which the emp_number column is less than the value of a bind variable xe2x80x9c:xxe2x80x9d. The value of a bind variable is not necessarily known at compile-time, but instead becomes known at run-time. Since optimization occurs at compile-time, the optimizer thus does not always know the value of the bind variable :x. Therefore, it becomes difficult to estimate a selectivity value for emp-number less than :x, which greatly affects the optimizer""s ability to select an optimal execution plan (e.g., one that uses either an index scan or table scan).
One approach that can be taken to estimate a selectivity value is to employ a default selectivity value if a bind variable is encountered in a database statement. Typically, the default selectivity value is used regardless of exact contents or operator of the database statement. The optimizer would use the default selectivity value to compute a cost estimate for an execution plan involving that database statement. However, using a default value to calculate the cost of an execution plan may result in what is at best a gross approximation of the true cost of the execution plan. Furthermore, such default values do not take into account any parameters that may be passed to a database statement predicate. This may result in the selection of an execution plan that has a significantly higher true cost than other execution plans that could have been chosen.
An alternate approach is to heuristically guess a selectivity value. For example, a small value for the selectivity of a database statement can be assumed if the involved column is indexed. This assumption is typically employed whenever a bind variable is used as a boundary value in a condition with one of the following operators: less than , greater than , less than =, or  greater than =. The optimizer would heuristically guess a small selectivity value for indexed columns to favor the use of indexes. The drawback to this approach is that the assumptions used in the heuristic guess do not always match up to real world conditions. For example, the assumption that an execution plan involving an index access is always more efficient than alternate execution plans may be grossly incorrect for a given query of an indexed column, depending upon the statistics for that column and the exact values of the arguments involved.
Therefore, there is a need for a method and mechanism that can optimize the execution of the database statements involving that have values which are not known at compile-time.
A method and system for optimizing the execution of database statements are described. An aspect of the present invention is directed to the generation of an execution plan for a database statement, wherein the database statement contains a predicate having one or more arguments whose values are not known at the time the execution plan is generated (e.g., at compile time). A feature of this aspect of the invention involves passing a description of the argument(s) to the optimizer. An example of such a description includes the argument type of the predicate argument.
Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.