A relational database management system provides the interface between a relational database and the users of the database system. Requests for access to the database, including adding or removing data and performing operations on the data, are handled by the relational database management system. The relational database management system includes an optimizer component whose purpose is to efficiently implement user requests for manipulation of data. The optimizer evaluates a user request and attempts to transform it into a more efficient expression of operations based upon an assessment of the cost of performing the operations. The optimizer may develop a set of alternative query plans each implementing the user request with a slightly different set of operational steps or a different ordering of the same set of operational steps. The optimizer must then evaluate the cost of each query plan and select the most efficient plan for fulfilling the user request. One of the principal assessments made by an optimizer in determining cost is the number of rows that will result from an operation. The number of rows at an intermediate stage of a query plan affects the amount of work involved in performing the remainder of the plan. If, for instance, after the first of two operational steps the number of rows of data is reduced from 100 to 3, the second step need only involve the 3 remaining rows. On the other hand, if the first step results in 98 rows of data, the second step involves 98 rows and, therefore, results in a more costly query operation. Accordingly, the estimation of the number of rows that will be involved in, and will result from, each operational step in the query plan is important in accurately estimating the cost of a query plan. The estimation of the number of distinct rows in a column is called “cardinality estimation”.
When a complex query involves multi-column predictates, it is difficult to estimate cardinality without having some knowledge about the relationship between the columns. For example, if a table contained a column for start date and a column for end date, a query may be made to find the number of projects active on a given day. The predicate that could be used is “WHERE START_DATE<=‘1995-08-30’ AND END_DATE>=‘1995-08-30’”. One technique used is to treat the columns independently and multiply the selectivities of the individual predicates. If it were known that 1995-08-30 was approximately in the middle of the column of values that spanned 10 years of data, each predicate would have a selectivity of about 50%. In other words, it would be true for half of the rows that the START_DATE is less than or equal to 1995-08-30 and it would be true for half of the rows that the END_DATE is greater than 1995-08-30. This would lead to a combined estimate of 25% of all projects started over the ten year period having been active on the date 1995-08-30. If projects rarely lasted more than a month, then this estimate would be far too high. If projects typically lasted five or more years, then perhaps the estimate would be too low. Without knowing more information about the relationship between the START_DATE and the END_DATE, we cannot accurately estimate the cardinality of multiple predicates on different columns.
Similarly, if a single predicate involves multiple columns, such as the predicate WHERE (END_DATE-START_DATE)<=5, the cardinality of this predicate is difficult to assess without statistical information about the relationship between these two columns.
Another situation in which cardinality estimation is difficult is where the predicate involves an expression. For instance, a predicate may wrap a single column in a function, such as “f1(C1)=10”. Even if we know a great deal about the column itself, it would be impossible to estimate the cardinality of this predicate without knowing more about the effect that the function would have upon the column.
Accordingly, it would be advantageous if the query optimizer was provided with statistical information regarding the relationship between columns and statistical information regarding the effect of a function or expression upon a column for use in estimating the cardinality of predicates.
One existing source of relational information regarding the data is integrity constraints. A database may be subject to any number of integrity constraints of arbitrary complexity. An integrity constraint (also called a business rule) describes a condition about the database that must be true. It is a declarative statement that the relational database management system must ensure remains true. When an update operation is performed, the relational database management system checks the integrity constraints to ensure the operation will not violate an integrity constraint. An example of an integrity constraint, in the case of a database containing educational test scores, could be that the test score cannot be less than zero or greater than one hundred.
Imposing integrity constraints upon a database can be costly in itself because the relational database management system must check the integrity constraint with each update of the database. In addition, many potentially useful integrity constraints for a database may not be known. Finally, even if there is a useful statistical expression that characterizes a database, there may be no justification for imposing a constraint that forces that characterization to remain true.
Accordingly, it would be advantageous to have a method of creating and maintaining statistical expressions that contain statistical information on relationships between columns in a database and assist in query optimization without requiring that the expressions remain true or be evaluated with each update to the database. Similarly, it would also be advantageous to have a method of creating and maintaining statistical expressions that contain statistical information regarding the effect of functions or expressions upon columns in a database for the same purpose.