1. Field
The present invention relates generally to databases and, more particularly, to query optimization in a database.
2. Background
Databases commonly organize data in the form of tables, where each table has a number of rows and columns. Each row in the table generally has a data value associated with each of the columns, where this intersection of a row and a column is commonly referred to as a cell. A system needing access to data in the database typically issues a request in the form of a query. The query usually involves a request for the data contained in one or more cells of any rows which meet a specific set of conditions. Such a condition often involves the comparison of the values of cells in a column to some other value to determine whether the row associated with the compared cell meets that condition. Conditions can be divided into two categories: local conditions and non-local conditions. Local conditions are those that are associated with a single table because any column referenced within the condition is from one table (e.g., “TABLE1.COLUMN1=100” and “TABLE1.COLUMN1=TABLE1.COLUMN2+1”). Non-local conditions are referred to as one of the following: join conditions (i.e., columns from multiple tables); having conditions (i.e., aggregate functions); or, prefilter conditions (i.e., no columns).
Traditionally, in performing a query, modern database management systems (“DBMSs”) translate declarative Structured Query Language (“SQL”) statements into an executable plan, also known as a Query Execution Plan (“QEP”), prior to the actual execution of the query. In a SQL DBMS, a number of methods are available to execute a given query, where each method varies in performance and resource consumption. A query optimizer evaluates these methods and attempts to determine the most efficient way to perform the query.
In a query involving two or more tables, the performance of a QEP is often determined largely by the order in which the tables are joined. For example, when joining tables A, B, and C of size 10 rows, 100 rows, and 1,000 rows, respectively, a QEP that joins tables A and C first may take more time to execute than one that executes B and C first. More specifically, a query optimizer computes all plausible methods to access tables A, B, and C. For each table, the optimizer estimates the least expensive way to access that table, evaluates local condition associated with that table, and keeps track of any associated sort order among the records. The optimizer then considers the join algorithms available to the SQL DBMS to generate a two-table QEP for the combination of each pair of tables (i.e., combination of tables A and B, combination of tables B and C, and combination of tables A and C). The optimizer estimates the least expensive way to join each pair of tables, as well as the least expensive way to join each pair of tables to generate an output according to a particular sort order. Based on this information, a three-table QEP is computed by joining each two-table QEP with the remaining table in the query (e.g., a two-table QEP containing tables A and C may be joined with table B to form a three-table QEP). As a result, three QEPs may be generated for tables A, B, and C, where each QEP includes a specific join order and the join algorithm for each join. The query optimizer then chooses one of the join algorithms to perform the actual query.
In choosing a join algorithm, the query optimizer bases its determination on a cost model. The cost model is a mathematical model that determines an estimate of the execution cost of the algorithm (e.g., I/O costs, CPU costs, and communication costs). Accurate cost estimation requires an accurate estimation of cardinalities of each processing step in the query, where a cardinality refers to a count of a number of rows that participate in the executed query step. In conventional methods, cardinalities are estimated based on statistical methods applied to data in a database. The statistical information is typically gathered by executing special purpose queries, which read all or part of a database table, performing an analysis on the special purpose queries, and storing the results of the analysis for later use by a query optimizer. Oftentimes, the statistical information is under- or over-inclusive with respect to a particular query, yielding inaccurate results. Alternatively, the estimates may come from information stored in auxiliary data structures referred to as indexes. For complex queries involving several tables and conditions, cardinality estimation using these methods may be highly inaccurate such that the query optimizer chooses an inefficient join algorithm or join order, resulting in a long query execution time and an inefficient use of computing resources. Once a sub-optimal join algorithm is chosen for a query, it is very difficult to modify the algorithm during query execution.
Accordingly, what is desired is a more accurate means of estimating cardinality such that a query optimizer chooses an more efficient QEP.