1. Field of the Invention
The present invention relates to optimization in a database system.
2. Background
In a database system, optimization is the process of choosing an efficient way to execute a database query or manipulation action. Examples of such query or manipulation actions include searching, retrieving, modifying, organizing, adding, and/or deleting information from the database. These database query/manipulation actions are normally initiated by submitting commands to a database server in a database query language. One popular database query language is known as the Structured Query Language (xe2x80x9cSQLxe2x80x9d). For the purposes of explanation only, and not by way of limitation, the following description is made with particular reference to database statements involving SQL.
To execute a database query language statement (e.g., a SQL statement), the database system may have to perform steps involving the retrieval or manipulation of data from various database structures, such as tables and indexes. Often, there exists many alternate ways to execute the SQL statement. For example, a single SQL statement can be executed in different ways by varying the order in which tables and indexes are accessed to execute the statement. The exact combination and order of steps taken to execute the SQL statement can drastically change the efficiency or speed of execution for the statement. The exact combination and order of steps that are used to execute a SQL statement is referred to as an xe2x80x9cexecution plan.xe2x80x9d
As an example, consider the following SQL statement, which queries for the name of all employees having a salary equal to 100 from a database table xe2x80x9cemp_tablexe2x80x9d:
SELECT employee_name
FROM emp_table
WHERE salary=100
A first execution plan could include the step of performing a full table scan of emp_table to execute the query. This first execution plan would retrieve every row from emp_table to identify particular rows that match the WHERE clause. Alternatively, if an index exists for the xe2x80x9csalaryxe2x80x9d column of emp_table, then a second execution plan could involve accessing the index to identify rows that match the WHERE clause, and thereafter retrieving only those identified rows from the table. The index is considered an alternate xe2x80x9caccess pathxe2x80x9d to the data sought by the SQL statement.
Each execution plan has a xe2x80x9ccostxe2x80x9d that is associated with its execution. The cost of an execution plan can be expressed in terms of the resources that are consumed to execute the SQL statement using that execution plan. For example, the cost of an execution plan can be expressed in units of I/O usage, CPU usage, network usage, memory usage, or a single numerical value that combines several of these units.
An xe2x80x9coptimizerxe2x80x9d is used by a database system to choose what is believed to be the most efficient execution plan for a SQL statement. A xe2x80x9ccost-basedxe2x80x9d optimizer bases its decision upon the costs of each execution plan. The cost-based optimizer typically generates a set of potential execution plans for the SQL statement based upon available access paths for the data sought to be operated upon by that statement. The cost is then estimated for each execution plan based upon, for example, data distribution and storage characteristics for database structures holding relevant data for the SQL statement. The optimizer then compares relative costs of the execution plans to choose the one with the smallest cost.
The cost-based optimizer may use statistics to estimate the cost of the execution plans. Statistics are used to quantify the data distribution and/or storage characteristics of data in database structures. For example, with reference to the SQL statement example set forth above, statistics may be kept for the distribution of values in the xe2x80x9csalaryxe2x80x9d column of the table xe2x80x9cemp_table.xe2x80x9d Selectivity estimates can be performed by taking into account the data skew of data values. Selectivity is normally calculated with reference to the statistics, and can be stated as the percentage of entries within a schema object that satisfies a given predicate.
The cost of an execution plan can be estimated based upon the statistics and selectivity associated with terms within the SQL statement predicate. As an example, consider if an index exists upon the xe2x80x9csalaryxe2x80x9d column for the above SQL statement example. If so, then the following is an example of a cost calculation that can be used to estimate the cost of an execution plan that uses an index to execute the above SQL statement:   COST  =                    (                  cost          ⁢                      xe2x80x83                    ⁢          of          ⁢                      xe2x80x83                    ⁢          access          ⁢                      xe2x80x83                    ⁢          for          ⁢                      xe2x80x83                    ⁢          a          ⁢                      xe2x80x83                    ⁢          single          ⁢                      xe2x80x83                    ⁢          row                )            *              (        selectivity        )            *              (                  number          ⁢                      xe2x80x83                    ⁢          of          ⁢                      xe2x80x83                    ⁢          rows          ⁢                      xe2x80x83                    ⁢          in          ⁢                      xe2x80x83                    ⁢          table                )              +          (              cost        ⁢                  xe2x80x83                ⁢        of        ⁢                  xe2x80x83                ⁢        index        ⁢                  xe2x80x83                ⁢        access            )      
An example of a cost calculation for an execution plan that performs a full table scan is expressed as follows:
COST=(cost of access for a single row of table)*(number of rows in table)
Based upon such cost calculations, an optimizer can make a determination as to which of these execution plans is relatively less costly.
Typical database systems have built-in support for certain xe2x80x9csystem-suppliedxe2x80x9d objects, such as built-in data types, functions, and access methods. For example, xe2x80x9cCHARxe2x80x9d is a common built-in data type that is used to store character strings. B+Trees and Hash Indexes are two examples of built-in access methods that can be found in conventional database systems.
In recent years, databases are being used to store different types of data, such as spatial, image, video, and audio data. Often, these data types are not native to the database system; it is unrealistic to attempt to provide native support for all possible data types since it is impossible to foresee all possible types of complex data that may be created for storage in the database. Therefore, some database systems can be configured to allow non-native or xe2x80x9cuser-definedxe2x80x9d data types to be defined for the database. For many of these non-native data types, system-supplied access methods and functions cannot be applied to operate upon them, since the structure and characteristics of these data types are not known or supported by the system-supplied access methods and functions. To provide efficient data operations upon these data types, non-native or user-defined access methods and functions can also be defined to extend the operation of the database system.
Because the structure, operation, and characteristics of built-in objects are known to the database system, and to the designers of the system-supplied database optimizer, traditional optimizers can generate cost estimates for execution plans involving such native objects or access methods. These cost estimates can be used to choose an optimal execution plan for a SQL statement involving built-in objects.
Unfortunately, traditional optimizers encounter significant problems attempting to generate an optimal execution plan if the SQL statement involves non-native objects, such as user-defined data types, functions, and access methods. This results because the system-supplied cost, selectivity, and statistics functions are not specifically configured to recognize or work with non-system-supplied entities, operations, and access methods. If the optimizer cannot accurately estimate the cost of an execution plan involving non-native objects, then relatively inefficient or slow execution plans may be mistakenly chosen by the optimizer.
One approach that can be used to address this problem is to utilize default cost, selectivity, or statistics values to determine the relative costs between execution plans. If a non-native objects is encountered by the optimizer, default values would be used to generate the cost of an execution plan. However, using default values to calculate the cost of an execution plan results in what is at best a gross approximation of the true cost of the execution plan. Furthermore, such default values do not take into account any parameters that may be passed to a predicate involving a user-defined object. This may result in the selection of an execution plan that has a significantly higher true cost than other execution plans that could have been chosen.
Therefore, there is a need for a method and mechanism that can optimize the execution of database statements that involves non-system supplied entities, objects, or operations.
A method and system for optimizing the execution of database statement is described. An aspect of the invention is related to providing a framework for the creation, association, and use of optimizer-related functions to calculate the cost of execution plans involving non-native database entities. Examples of these optimizer-related functions include selectivity, statistics, and cost functions. Another aspect of the invention is related to the optimization of database statements having non-boolean predicates. Yet another aspect of the invention is related to the determination and use of different granularity levels of information for submitting database statement predicates to optimizer-related functions. The invention can also be applied to define optimizer-related functions for system-supplied objects and entities.
Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.