1. Field of the Invention
This invention relates to a database management system for optimizing queries, and more specifically for determining an estimation for a number of qualified rows of a query (i.e., a selectivity value), e.g., for use in determining optimized access paths.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems. A relational database management system (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. A DBMS is structured to accept commands to store, retrieve, and delete data. One widely used and well known set of commands is called the Structured Query Language (SQL).
Relational databases are organized into tables which consist of rows and columns of data. The rows are formally called tuples. A database will typically have many tables and each table will typically have multiple columns. The tables are typically stored on random access storage devices (DASD) such as magnetic or optical disk drives for semi-permanent storage.
Tables are at the center of a relational database engine; and a major goal of the underlying query compiler is to provide a suite of mechanisms so that user data can be easily stored and efficiently manipulated.
An important function of query optimization is estimating the number of qualified rows accurately. This estimation is used to determine an optimized access path by the DBMS to the data. An inherent problem occurs during this estimation process when the query has predicates (local or join) on multiple columns of a table. Most database optimizers make an assumption that there is no relationship between those predicates; and they evaluate the predicates independently when estimating the number of qualified rows. This independence assumption, however, may, in many cases, be incorrect and may cause a very inaccurate qualifying row estimate. The following example illustrates this problem.
SELECT*FROM INVENTORY_TABLE PA1 Number of rows in INVENTORY_TABLE=500,000 PA1 Number of distinct values in CURR_LOCATION=50 PA1 Number of distinct values in ORIG_LOCATION=50 PA1 X1, X2 are the high and low bounds for the X-coordinate of the query PA1 Y1, Y2 are the high and low bounds for the Y-coordinate of the query PA1 Z1, Z2 are the high and low bounds for the Z-coordinate of the query PA1 XA, XB are the high and low bounds for the X-coordinate of the quantile PA1 YA, YB are the high and low bounds for the Y-coordinate of the quantile PA1 ZA, ZB are the high and low bounds for the Z-coordinate of the quantile
WHERE CURR_LOCATION=`LOC1` PA2 . . . etc., for each coordinate (dimension) of the query PA2 . . . etc., for each coordinate (dimension) of the quantile.
AND ORIG_LOCATION=`LOC1`
Selectivity is the percentage of rows that would qualify. Selectivity of "CURR_LOCATION=`LOC1`" is evaluated to be="0.02" (i.e., one out of every fifty rows would qualify). Selectivity of "ORIG_LOCATION=`LOC1`" is evaluated to be="0.02". The number of qualified rows from INVENTORY_TABLE is evaluated to be: EQU (0.02*0.02)*500000=200
In the previous example the optimizer assumed that there was no relationship between the two predicates in the query and evaluated them as if they were independent. In this case, however, assume that most items in the INVENTORY_TABLE have the same value for both the CURR_LOCATION and ORIG_LOCATION columns. In this case the actual number of rows that qualified would be much larger than the estimated number.
One type of statistic that takes into consideration multiple columns is called "FULLKEYCARD". A "FULLKEYCARD" concatenates all of the columns in an index for indicating the number of distinct key values in an index. It has been previously used by DB2/MVS and other RDBMS vendors. While FULLKEYCARD does help with column correlation in limited circumstances, it has its limitations. For example, FULLKEYCARD can only be used for an index that has the same set of predicates that are in the query. Since it cannot be applied to non-indexed columns, it is not very well suited to solve the problem of column correlation discussed above.