Computer systems employ a relational database management system or a RDBMS which is a computer program that manages data storage and retrieval. Many RDBMS are being widely used because they permit the database users to enter queries into the database in order to extract requested data.
The data is present within the database system in one or more tables or relations. Each relation consists of a number of records or tuples containing specific information grouped in some sequence. Each tuple consists of one or more fields which are called attributes. In any single attribute of a tuple or information there can be only a single value, however, different tuples can have different values for the same attribute. As an example, a relation called EMPLOYEE contains information about employees. The relation consists of a number of tuples where each tuple corresponds to one employee. For instance, the tuple can have the attributes (Emp.sub.-- Name), (Age), and (Dept.sub.-- No) representing the name of the employee, the age of the employee, and the department the employee works for.
Two kinds of statistics are typically important to proper management by the RDBMS of the database's stored relations. One is the number of tuples contained in a single relation. This value is known as the relation's cardinality and is denoted by .parallel.R.parallel., wherein R is the relation. The second is the number of distinct values taken by an attribute denoted by d.sub.a, wherein (a) is the attribute. The value of the relation's cardinality, .parallel.R.parallel., is important because it indicates the overall size of the relation. The number of distinct values, d.sub.a, is important because it is used to determine the size of the results from different operations on the relation data.
One database query language is known as SQL. With such a query language the users do not have to be aware of the physical locations of the data or the methods used to access the data. To the database user, all the information that is stored within the database is accessible by a query which is a request by the user for some database information. The database user enters in an SQL command and the RDBMS decides how the data should be retrieved and initiates the data retrieval.
In order for the RDBMS to be effective as a database information manager, the RDBMS uses several operations to enhance the obtaining of the data desired by the user in the SQL command. One of the most important operations relied upon by the database management system is the join operation. We use the symbol U to denote the join operation. A join operation is used to match tuples from two relations on the values taken from each relation on one common attribute. For example, let one relation be the entire DEPARTMENT relation having, for instance, attributes called (Dept.sub.-- Name) and (Dept.sub.-- No). Let another relation be the EMPLOYEE relation having attributes (Emp.sub.-- Name) and (Dept.sub.-- No). In order to obtain for each employee the name of the department that the employee works for, the RDBMS must join the relations EMPLOYEE and DEPARTMENT on the common attribute (Dept.sub.-- No). The join is specified by a join predicate which describes the attributes to be matched. In this example, the join predicate can be expressed as (EMPLOYEE.Dept.sub.-- No=DEPARTMENT.Dept.sub.-- No). The resulting effect of the join operation is that for each tuple in the EMPLOYEE relation the RDMBS locates the tuple in the DEPARTMENT relation with the same value in the attribute (Dept.sub.-- No). After this is complete, the RDBMS can return for each employee the name of the department, i.e., (Dept.sub.-- Name) that the employee works for.
Another important operation performed by the RDBMS in response to an SQL query command is the selection operation which is used to locate and select only those particular tuples matching a certain criterion or condition obtained from one relation. For instance, user may choose to retrieve all employees who are older than age 50 years, and this operation is described by a local predicate expressed as (EMPLOYEE.Age &gt;50).
A single query entered into an RDBMS may result in multiple operations being performed in order to conduct a more complex query in the shortest amount of time. Multiple local predicates on one or more relations can be used together with multiple join operations in the same query. For example, the query may combine the selection and join operations in order to retrieve the results for all employees who are older than 50 years together with all the names of the departments for which these employees work.
When the selection or join operation is actually performed, the local or join predicate is said to be evaluated by the RDBMS. For the relation involved in a join predicate, the specific attribute that is present in the join predicate is called the join predicate's attribute or join attribute of the relation.
If a local predicate is present, the local predicate may serve to reduce the cardinality of the relation. As an example, given the full EMPLOYEE relation and the local predicate (EMPLOYEE.Age &gt;50), performing the selection operation involving the local predicate serves to extract only those employees whose ages are greater than 50. The number of employees in this result is clearly smaller than the number of employees in the full EMPLOYEE relation. This reduced cardinality is called the effective cardinality of the relation. It can be used in other cardinality calculations, e.g., for estimating join result sizes. The local predicates can change the effective cardinality of the relation and the number of distinct values in the attributes of the relation thus changing the size of the final join result. What is also needed in the art is an algorithm for taking the effect of local predicates into account.
When a relation is joined with another relation, it creates a resulting relation. If some other relations need to be joined with the first result relation, the result relation is often referred to as an intermediate relation because it is created in the process of generating a final result relation. An intermediate relation I may be the result of joining several relations. These joined relations are said to be present in the intermediate relation I.
A relation and its attributes may participate in several join predicates. In a particular ordering of the join operations in a query when a relation R is joined with another (possibly intermediate) relation I, not all of the join predicates can be evaluated at any one time. Only those join predicates linking attributes in relation R with the corresponding attributes in a second relation S which is present in relation I, can be evaluated. These particular join predicates are termed eligible join predicates because only these join predicates can be evaluated at this point in time.
A database system user retrieves data from the database by entering requests or queries into the database. The RDBMS interprets the user's query and then determines how best to go about retrieving the requested data. In order to achieve this, the RDBMS has a component called the query optimizer. The RDBMS uses a query optimizer to analyze how to best conduct the user's query of the database; optimum speed in accessing being the underlying motivation. The query optimizer takes the query and generates what is called a query plan. The query plan is a manipulation or rearrangement of the user's SQL commands, in some instances eliminating redundancies, in order to simplify the query prior to processing. There may be several alternative query plans generated by the query optimizer, each specifying a set of operations to be executed by the RDBMS. The many query plans generated for a single query ultimately differ in their total cost of obtaining the desired data. The query optimizer then evaluates these cost estimates for each query plan in order to determine which plan has the lowest execution cost. In order to determine a query plan with the lowest execution cost, the query optimizer uses specific combinations of operations to collect and retrieve the desired data. When a query plan is finally selected and executed, the data requested by the user is retrieved according to that specific query plan however manipulated or rearranged.
One of the most expensive operations to conduct is the join operation which joins together two or more entire relations which can be fairly large. When multiple joins are present in a user generated query, as is typically the case the cost increases dramatically. The query optimizer has to determine the execution order in which the relations will eventually be joined so as to minimize cost. The join ordering chosen by the query optimizer is important for the determination of the lowest cost query plan.
In order to choose between different join orderings, the query optimizer needs to estimate the eventual result size or cardinality of the two joined relations. In other words, the result size of the join operation R.sub.1 .orgate.R.sub.2 .orgate.R.sub.3 depends on the .parallel.R.sub.1 .parallel., .parallel.R.sub.2 .parallel., .parallel.R.sub.3 .parallel., .parallel.R.sub.1 .orgate.R.sub.2 .parallel., .parallel.R.sub.2 .orgate.R.sub.3 .parallel., and .parallel.R.sub.1 .orgate.R.sub.3 .parallel.. Note that the join of R.sub.1 .orgate.R.sub.2 produces an intermediate result the size of which is going to depend on .parallel.R.sub.1 .parallel. and .parallel.R.sub.2 .parallel..
When intermediate relations are involved, the query optimizer estimates the final join result sizes incrementally, i.e., first determine the size of joining the first two relations, then determine the size of the join with a third relation, and so on. In such a manner, the query optimizer incrementally estimates the final join result size from the result sizes of the intermediate relations produced.
Another situation in the incremental calculation of join result sizes arises when two or more of the eligible join predicates involve join attributes of the new relation having the same equivalence class. Two attributes are considered to belong to the same equivalence class when they are linked by a join predicate. For example, consider a relation R.sub.1 joining with relation R.sub.2 with the following join predicates: (R.sub.1.a=R.sub.2.b), (R.sub.1.c=R.sub.2.d) and (R.sub.1.a=R.sub.1.c). When R.sub.1 is joined with R.sub.2, the first two join predicates are eligible. They involve attributes (a) and (c) of R.sub.1 that belong to the same equivalence class by virtue of the join predicate (R.sub.1.a=R.sub.1.c).
Current query optimizers do not treat this as a special case, and therefore by default, the join selectivities of all such join predicates are used for calculation of the join result sizes which produces an incorrect result because the join attributes are not independent. What is needed in the art is a join result size method which handles the case wherein two or more of the eligible join predicates involve join attributes of the new relation having the same equivalence class.
Next, consider a join predicate (J:(R.sub.1.a=R.sub.2.b)), where R.sub.1 is a relation having an attribute (a) and R.sub.2 is a relation having an attribute (b). For each join predicate, e.g., (J:(R.sub.1.a=R.sub.2.b)), another quantity called the selectivity needs to be estimated by the query optimizer. This selectivity is a value which is a fractional number between 0 and 1. In the example, the selectivity of join predicate J, which is denoted by S.sub.J, is given typically as S.sub.J =1/max(d.sub.a, d.sub.b). The size or the result of joining of R.sub.1 and R.sub.2 is given by .parallel.R.sub.1 .parallel..times..parallel.R.sub.2 .parallel..times.S.sub.J.
Initially, each join attribute is given an equivalence class and two join attributes are given the same equivalence class if they are linked by the same join predicate. Two linked join predicates can then be transitively closed if they share a common join attribute in order to obtain another join predicate. Thus, given join predicates (J1:(R.sub.1.x=R.sub.2.y) AND J2:(R.sub.2.y=R.sub.3.z)), one can obtain the join predicate (J3:(R.sub.1.x =R.sub.3.z)) using the principle of transitivity. Since join attributes (x) and (y) are given the same equivalence class and since join attributes (y) and (z) are given the same equivalence class, it follows that attributes (x), (y) and (z) must be given the same equivalence class. This is important because when attributes are in the same equivalence class, the effects of joining them are no longer independent. As an example, once join predicates J1 and J2 have been evaluated, J3 has been evaluated effectively and hence evaluating it can have no further effect on the size of the join result. This indicates the importance of carefully choosing the join selectivities to multiply.
Given a series of relations that are to be joined, as discussed the join result sizes are calculated incrementally in the art. Join estimations in the art use either a multiplicative approach or a single selectivity approach, both of which produce incorrect join result sizes in certain cases.
In the multiplicative prior art approach, the selectivities of all the eligible join predicates are multiplied together along with the product of the individual relation cardinalities. The following example illustrates that this approach produces incorrect results in certain cases because the selectivities are not independent.
For example, let the following be an SQL statement entered as a query into a database:
SELECT R.sub.1.u PA1 FROM R.sub.1, R.sub.2, R.sub.3 PA1 WHERE (R.sub.1.x=R.sub.2.y) AND (R.sub.2.y=R.sub.3.z) AND (R.sub.1.x=R.sub.3.z)
The join predicates here are (R.sub.1.x=R.sub.2.y), (R.sub.2.y=R.sub.3.z) and (R.sub.1.x=R.sub.3.z). Let the statistics for these relations in the database be as follows: EQU .parallel.R.sub.1 .parallel.=100, d.sub.x =10 EQU .parallel.R.sub.2 .parallel.=1000, d.sub.y =100 EQU .parallel.R.sub.3 .parallel.=1000, d.sub.z =1000 EQU Ja:(R.sub.1.x=R.sub.2.y), S.sub.Ja =1/max(d.sub.x, d.sub.y)=1/max(10,100)=0.01 EQU Jb:(R.sub.2.y=R.sub.3.z), S.sub.Jb =1/max(d.sub.y, d.sub.z)=1/max(100,1000)=0.001 EQU Jc:(R.sub.1.x=R.sub.3.z), S.sub.Jc =1/max(d.sub.z, d.sub.z)=1/max(10,1000)=0.001
Using the semantics of the join operation with the selectivities above denoted by S.sub.Ja, S.sub.Jb, and S.sub.Jc, the correct result size for .parallel.R.sub.1 .orgate.R.sub.2 .orgate.R.sub.3 .parallel. is known to be 1000. Now, let R.sub.2 be first joined with R.sub.3, i.e., let the intermediate result relation I be R.sub.2 .orgate.R.sub.3, then: EQU .parallel.R.sub.2 .orgate.R.sub.3 .parallel.=.parallel.R.sub.2 .parallel..times..parallel.R.sub.3 .parallel..times.S.sub.Jb EQU .parallel.R.sub.2 .orgate.R.sub.3 .parallel.=1000.times.1000.times.0.001 EQU .parallel.R.sub.2 .orgate.R.sub.3 .parallel.=1000
In the join of R.sub.1, the multiplicative approach produces the join result size as follows: EQU .parallel.R.sub.2 .orgate.R.sub.3 .orgate.R.sub.1 .parallel.=.parallel.I.parallel..times..parallel.R.sub.1 .parallel..times.S.sub.Ja .times.S.sub.Jc EQU .parallel.R.sub.2 .orgate.R.sub.3 .orgate.R.sub.1 .parallel.=1000.times.100.times.0.01.times.0.001 EQU .parallel.R.sub.2 .orgate.R.sub.3 .orgate.R.sub.1 .parallel.=1 (incorrect)
As shown above, the multiplicative approach produced an incorrect join result size of 1 in the simple example above because of the selectivities being factored in without change.
Another approach in the prior art in producing join result sizes is to make a single selection from the set of join selectivities available in order to calculate the join result sizes by dividing the eligible join predicates into groups. Then, choosing for each group a single join predicate and its particular join selectivity value to calculate the intermediate and final join result sizes. Such an approach chooses one join selectivity from each group, wherein the groups of join attributes are associated by equivalence classes by the RDBMS.
The single selectivity approach is typically used to pick the smallest join selectivity in each group among all the available join selectivities. This can produce incorrect results in certain cases. In the above example, once R.sub.2 and R.sub.3 have been joined, R.sub.1 can be joined to R.sub.2 .orgate.R.sub.3 using the join predicates Ja:(R.sub.1.x=R.sub.2.y) and Jc:(R.sub.1.x=R.sub.3.z). These predicates involve the same join attribute (x) of relation R.sub.1. In the single selectivity approach, only one of the join predicate selectivities is used, in particular, the smaller of the join selectivities. Since selectivity (S.sub.Jc &lt;S.sub.Ja), the single selectivity approach, for the identical problem, produces the join result size as follows: EQU .parallel.R.sub.2 .orgate.R.sub.3 .orgate.R.sub.1 .parallel.=.parallel.I.parallel..times..parallel.R.sub.1 .parallel..times.S.sub.Jc EQU .parallel.R.sub.2 .orgate.R.sub.3 .orgate.R.sub.1 .parallel.=1000.times.100.times.0.001 EQU .parallel.R.sub.2 .orgate.R.sub.3 .orgate.R.sub.1 .parallel.=100 (incorrect )
As shown above, the single selectivity approach produced an incorrect join result size of 100 in this example simply because the wrong join selectivity was chosen.
Therefore, what is also needed in the art is a method for correctly choosing the join selectivities for query optimization so as to produce correct join result sizes.