The present invention generally relates to database processing systems, and more specifically to optimizing SQL database queries to optimize processing of multi-column database indexes.
Relational database optimizers have the task of converting the xe2x80x9cwhat is desiredxe2x80x9d that a user expresses in the database interface language SQL into the xe2x80x9chow to accomplishxe2x80x9d specifics entailed in a database access plan.
In order to more efficiently access (especially relational) databases, indexes are typically utilized. A given database may have multiple indexes. Each index contains essentially a pair consisting of a key and a link to a database row or record. The pairs are then effectively organized in sorted key order. One well known method of implementing indexes is the use of B-Trees. A database row or record can then be rapidly accessed utilizing the row or record xe2x80x9ckeyxe2x80x9d to identify the corresponding row or record. In the case of a relational database, the index xe2x80x9ckeysxe2x80x9d are typically the column values for one or more columns for each row or record in the database. In the case of a multiple column index, the index xe2x80x9ckeyxe2x80x9d can be viewed as the corresponding column values for each row concatenated together.
Indexes are also often utilized when processing database queries. A database optimizer is typically utilized to transform a SQL database query into an optimal set of database operations. One typical optimization is to utilize indexes to a database in order to minimize the necessity of reading entire database rows. That is because the requisite column values for the rows have already been extracted from the rows during index creation or update before the query is processed. Thus, it is often very efficient to eliminate all rows or records from a query that don""t match search key values in an index or indexes before any actual rows are retrieved for processing. Thus, database optimizers often group query components that reference indexed columns together. The following disclosure is primarily concerned with this situation.
A typical SQL query against a database has the form shown in the following example:
SELECT FIRST_NAME, LAST_NAME, PHONE_NUMBER FROM TELEPHONE_TABLE WHERE LAST_NAME=xe2x80x9cGRAYxe2x80x9d
This query returns the names and phone numbers of people listed in the TELEPHONE TABLE that have a last name of xe2x80x9cGRAYxe2x80x9d.
A good access plan for this simple example would be accessing the table via an index on the LAST_NAME column. An index search using the value xe2x80x9cGRAYxe2x80x9d is a fast method of returning the correct answers. Refer to xe2x80x9cAn Introduction to Database Systemsxe2x80x9d by C. J. Date for further background on relational databases and optimizers.
In current commercial relational database optimizers, there are deficiencies in producing the best access plan in all cases. In particular, when a multi-column index is defined, and a range of values is needed to satisfy the query, then the index may not be used in the most efficient manner.
This deficiency is the result of an xe2x80x9cimpedance mismatchxe2x80x9d between the need for SQL to specify search conditions on column values and the methods needed to specify range conditions on multi-column indexes. This is because SQL by its design goals has to be independent of the current index definitions on the database tables (i.e. not making use of index names). For further information, refer to explanations of the ANSI SQL standard such as xe2x80x9cSQL Instant Referencexe2x80x9d by M. Gruber or xe2x80x9cUnderstanding the New SQL: A Complete Guidexe2x80x9d by J. Melton and A. Simon. This use of the term xe2x80x9cimpedance mismatchxe2x80x9d is not to be confused with that of using SQL embedded in programming languages such as COBOL. When used in that context, the term refers to SQL""s set oriented nature as opposed to the row at a time capability of the language/database interface and its use of cursors.
It would be advantageous to be able to efficiently translate SQL queries into a proper access plan when there is a set of conditions on the columns of a multi-column index that represent a range of values on that index. This would provide significantly faster processing of multi-column indexes by selecting only qualifying key values from the index where there is a set of conditions on the columns of a multi-column index that represent a range of values on that index.
For example, assume that a table DATE_TABLE has an index defined on the following three columns whose combined value represents a date:
YEAR, MONTH, DAY
Suppose the result of a query is to be those rows from the table that have dates in the following range:
March 15, 1994 to June 23, 1996
Assume that the table""s MONTH values are encoded as 1 to 12 for January through December. The SQL required to express this range of index values can be given in positive (inclusive) form as follows:
FIG. 1 is a diagram illustrating the logical operation of that SQL statement in positive (inclusive) form. The FIG. consists of single and double headed time line segments. The double-headed line segments represent time intervals bounded at both ends, whereas the single-headed line segments represent time intervals bounded at one end, but not the other. Table T-1 contains the correspondence between the above SQL and the reference numbers in the FIG. The final result or selection (March 15, 1994 to June 23, 1996) is shown as 66.
This same multi-column index range condition can also be expressed in a negative (exclusive) form. This is often as recommended by vendors and/or database specialists (reference DB2 Design and Development Guide by G. Wiorkowski and D. Kull, section 9.3 Cursors and Repositioning, especially page 253).
The following is an example of the recommended negative form of the same conditions:
This is equivalent to specifying the total range of 1994 to 1996 and then removing the pieces that don""t fit as specified by the remaining NOT conditions. FIG. 2 is a diagram that illustrates this negative (exclusive) form as marked by the line numbers from the example above on which the conditions occur. The correspondence between the reference numbers in the FIG. and the line numbers in the above negative form SQL is shown in Table T-2. The final result or selection (March 15, 1994 to June 23, 1996) is shown as 86.
One problem with most current relational database systems is that when either form of the index range condition is supplied in the SQL, only the first column of the index is used in the access plan. For the negated form, this means that only the following conditions are used in accessing the index:
YEAR greater than =1994 AND YEAR less than =1996
Problems with performance arise when the data in the table is distributed so that there are a large number of rows that satisfy this condition, but only a few rows satisfy the entire set of search conditions specified in the WHERE clause. In this case, the overall performance of the query is considerably reduced from what the best possible access plan could produce, which is to use all column conditions in the index search.
It would thus be advantageous to be able to efficiently translate SQL queries into a proper access plan when there is a set of conditions on the columns of a multi-column index that represent a range of values on that index. For example, when accessing a range of dates utilizing an index containing YEAR, MONTH, and DAY columns, it would be advantageous to be able to start and stop extraction of rows from the corresponding database based on the corresponding dates. This translation would provide significantly faster processing of multi-column indexes where there is a relationship between the multi-column conditions.