The present invention relates generally to the field of database systems. More specifically, the present invention is related to a method and system for estimating the size of a joined table.
Relational databases systems allow a database user to enter queries into the database and return the data that meets the conditions of the query. The data present within the database system is in one or more tables or relations. Each relation consists of a number of records or tuples containing specific information possibly grouped in some ordered sequence. Each tuple consists of one or more fields called attributes. In any single attribute of a tuple there can be only a single value. However, each tuple can have a different value for the same attribute.
Some characteristics of a database relation are typically maintained. For example, the database system may maintain the cardinality, the density and the number of distinct values of each relation. Cardinality is the number of tuples or records in a table, especially the number of rows of a table. The number of distinct values is the number of distinct values of a given attribute or set of attributes. The density is the average number of tuples per distinct value.
One operation performed by a database system is known as a join operation. A join operation is used to combine related tuples from two relations into single tuples. Typically, the join operation matches tuples from two relations on the values taken from each relation on a common attribute and creates a joined table or relation. If another relation needs to be joined with the result relation, the result relation may be referred to as an intermediate relation because it is created in the process of generating a result relation.
A query entered into a relational database system may result in multiple operations being performed. In many cases, the operations of the query can be performed in several different orders without changing the result of the query. Each possible order of operations is referred to as a query execution plan. There may be several alternative query execution plans, each specifying a set of operations to be executed by the database system. Each different query execution plan will have a different resource usage, i.e. processing time, memory usage etc.
Relational database systems typically include a component called a query optimizer. The query optimizer may identify several query execution plans, estimate the cost of each different query execution plan, and select the plan with the lowest estimated resource usage for execution. Query plans generated for a query will differ in their resource usage of obtaining the desired data. The query optimizer evaluates these resource usage estimates for each query plan in order to determine which plan is likely to have the lowest resource usage.
The join operation can be quite expensive, since joining together two or more entire relations can result in a very large relation. When multiple joins are present in a query, the resource usage of a bad execution plan may increase dramatically. It is important for the query optimizer to identify a query execution plan that minimizes resource usage. The join ordering chosen by the query optimizer is often a key factor in the ultimate resource usage of the query execution plan.
Therefore, it would be advantageous to have a mechanism for estimation the cardinality of the joined table in an accurate way to enable the query optimizer to select a query execution plan which is optimal in view of resource usage.