Relational databases systems allow a database user to enter queries into the database and return the data that meets the conditions of the query. The data present within the database system is in one or more tables or relations. Each relation consists of a number of records or tuples containing specific information possibly grouped in some ordered sequence. Each tuple consists of one or more fields called attributes. In any single attribute of a tuple there can be only a single value. However, each tuple can have a different value for the same attribute.
Some characteristics of a database relation are typically maintained. For example, the database system may maintain the cardinality, the density and the number of distinct values of each relation. Cardinality is the number of tuples or records in a relation. The number of distinct values is the number of distinct values of a given attribute or set of attributes. The density is the average number of tuples per distinct value.
One operation performed by a database system is known as a join operation. A join operation is used to combine related tuples from two relations into single tuples. Typically, the join operation matches tuples from two relations on the values taken from each relation on a common attribute and creates a joined table or relation. If another relation needs to be joined with the result relation, the result relation may be referred to as an intermediate relation because it is created in the process of generating a result relation.
Another operation performed by a database system is the select operation. The select operation is used to select a subset of tuples from a relation that satisfy a selection condition. One can consider the select operation to be a filter that keeps only those tuples that satisfy a qualifying condition.
A query entered into a relational database system may result in multiple operations being performed. For example, selection operations on one or more relations can be used together with multiple join operations in the same query. In many cases, the operations of the query can be performed in several different orders without changing the result of the query. Each possible order of operations is referred to as a query execution plan. There may be several alternative query execution plans, each specifying a set of operations to be executed by the database system. Each different query execution plan will have a different “cost.” The costs may include the cost of accessing secondary storage, computation cost, memory usage cost, and communication cost.
Relational database systems typically include a component called a query optimizer. The query optimizer may identify several query execution plans, estimate the cost of each different query execution plan, and select the plan with the lowest estimated cost for execution. Query plans generated for a query will differ in their total cost of obtaining the desired data. The query optimizer evaluates these cost estimates for each query plan in order to determine which plan is likely to have the lowest execution cost.
The join operation can be quite expensive, since joining together two or more entire relations can result in a very large relation. When multiple joins are present in a query, the cost of a bad execution plan may increase dramatically. It is important for the query optimizer to identify a query execution plan that minimizes cost. The join ordering chosen by the query optimizer is often a key factor in the ultimate cost of the query execution plan.
In view of these considerations, query optimizers often use estimates of the cardinality of joins in attempting to select the most efficient query plan. It may be desirable to estimate the cardinality of the join of a table R with a table S where attribute a of table R is equal to attribute b of table S, denoted RR.a=S.bS. One prior art method of estimating the cardinality of such a join relied on what is referred to as the “containment assumption.” According to the containment assumption, each group of distinct valued tuples belonging to the relation with the minimal number of distinct values joins with some group of distinct valued tuples in the other table. The containment assumption has been extended to estimate the cardinality of a multi-predicate join. One problem with an existing method of estimating the cardinality of a multi-predicate using the containment assumption join is that cardinality is often overestimated.
There is a need for a method of estimating cardinality using multi-column density values and additionally using coarser density values of a subset of the multi-column density attributes to reduce the overestimation problem.