Database systems often maintain statistics on the objects accessed by database queries. Such statistics may be used by a query optimizer to determine cost estimates for alternative query plans before executing a query. Generally, optimizer statistics fall into one of two categories: table statistics or column statistics. Table statistics are used by the optimizer to estimate the cost of table accesses. Example table statistics may include information on the number of rows within a table, the number of data blocks used for the table, and the average row length of a table. Column statistics are used by the optimizer to determine the distribution of values within a column. Example column statistics may include information on the number of distinct values in a column, the minimum and maximum values in the column, and histograms of column values.
Table and column statistics allow the optimizer to accurately estimate the selectivity and cardinality of simple predicates. If there is a uniform data distribution, then the cardinality for an individual predicate referencing a single column may be computed by dividing a first value representing the number of rows in the table to which the column belongs by a second value representing the number of distinct values for the column. This technique, while accurate in the case of simple predicates that involve a single column, is less effective for estimating the cardinality of complex predicates, which may comprise a conjunction of two or more predicates and involve multiple columns.
Estimating the cardinality and selectivity of complex predicates is complicated due to relationships and correlations that often exist between data stored in different columns. As an example, a particular value in a “State” column may only be associated with a single value in a “Country” column. If both values are used in a where clause to filter database records, then the optimizer may incorrectly assume that the number of distinct values in each column will independently affect the cardinality (i.e., the number of rows returned) of the predicate. Because of the relationship between the data, however, both columns will not reduce the number of rows returned for the predicate.
In order to improve the accuracy of cost estimates for complex predicates, one approach involves maintaining multi-column statistics for groups of correlated columns. Once a column group has been defined, the statistics on that column group are collected along with the table statistics and single-column statistics. The statistics may include the number of distinct values and histograms for the column group as a whole. This approach allows for more accurate cost estimates when multiple columns are used together in a query and the columns are correlated. However, maintaining extended statistics on column groups generally requires a database administrator or other user to define the column groups before the optimizer begins collecting multi-column statistics. The relationships between columns may not always be readily apparent to an administrator. Further, it may be difficult to anticipate the types of predicates that will be received by the database server. If multi-column groups are defined on column groups that are rarely used together in queries, then the optimizer may consume valuable resources tracking multi-column statistics that are rarely used.
The cost estimates produced by the optimizer based on the statistics that are maintained by conventional database systems result in cost estimates that are not perfectly accurate. Thus, it would be useful to maintain additional statistics that would further improve the accuracy of cost estimates made by the query optimizer. It is further desirable if such statistics may be used in additional ways to improve query performance.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.