In database systems query execution plans are used to provide a preferred method for executing a particular query on the database system. To provide the query execution plan, the cardinality of the key, or group of columns, in the query is determined. The cardinality of a group of columns is the number of distinct values present in that group of columns at the time of query execution. The cardinality can thus be used to determine the number of rows that will be used in executing the query. Consequently, the cost of each query execution plan can be tallied. Based on the cost, a query execution plan can be selected. Thus, use of the cardinality allows for better selection of query execution plans.
FIG. 1 depicts a conventionally method 10 for estimating the cardinality for a group of columns, such as a key for a query. Statistics for single columns are obtained, via step 12. Typically, step 12 includes obtaining the cardinality for each individual column. The statistics for single columns are used in estimating the cardinality for the group of columns assuming that the columns are independent, via step 14. Typically, step 14 is accomplished by multiplying the individual cardinalities of the columns. The cardinality obtained in step 14 is an upper bound for the cardinality of the group of columns. Once the cardinality for the column group has been obtained, the cardinality can be used, for example in obtaining a query execution plan.
Although the conventional method 10 functions, one of ordinary skill in the art will readily recognize that the cardinality obtained in step 14 may result in a sub-optimal query execution plan. For example, the cardinality obtained in step 14 may be particularly poor after operations, such as GROUP BY or DISTINCT operations, are performed. When table(s) are first provided, columns in the tables may be statistically correlated. Operations such as grouping or duplicate removal operations may be performed on certain columns in the tables. A grouping operation takes rows having the same value for the columns in the grouping key and provides a single row from these rows. A duplicate removal operation may be considered a special case of the grouping operation and keeps only one row of a group of identical rows in the key. After such operations are performed, the conventional method 10 provides a cardinality that may differ significantly from the actual cardinality of the columns in the key because method 10 does not account for statistical correlation between columns. As a result, operations that use the cardinality may not perform as desired. For example, a query execution plan determined using the cardinality estimated in a conventional manner may have costs that are significantly greater than an optimal query execution plan.
Accordingly, what is needed is a method and system for improving the estimation of the cardinality of a group of columns, particularly after operations such as a grouping and/or duplicate removal operation has been performed. The present invention addresses such a need.