This disclosure relates generally to computer-based mechanisms for optimizing a database query, and more particularly to optimization of outer join query execution plans where all joins are evaluated by semijoins.
In relational databases, a relation represents a set of entities such as customers or products as an association between properties of the entities called attributes. An element of a relation is an ordered set of associated properties called a tuple. A relational database contains tables of data, and each table implements a relation. Each row in the table represents a tuple in the relation. Each column represents a value range of an attribute, such that each cell in the column contains a value, which may be a repeated value or a null value.
A central task in enterprise data processing is to extract from a relational database the set of tuples that meet some given logical condition. Often, the task is to extract tuples that each contain values from more than one table in the database. A table for relation A can include a column for the same attribute as another table for relation B. In such cases, the tables A and B can be joined logically by matching rows via the values of this attribute. To extract a specified set of tuples that include properties from several tables, the respective tables need to be joined.
Search and classification systems, such as the SAP NetWeaver TREX system, perform many tasks, including extraction of tuples that meet given logical conditions. In such cases, the task is called a search request or a database query, and the set of tuples that meet the logical condition specified in the query is called the result set. In order to perform such tasks with the required efficiency, a system like TREX uses highly optimized techniques for joining tables. One such technique is described in more detail below.
A join is a special case of a search condition of a database query characterized by having an expression of the form A.x=B.y, where A and B are two relations within a database, x is an attribute of A and y is an attribute of B.
In many cases, join operations are computationally expensive. For this reason, much research has been devoted to devising efficient algorithms for resolving them. For example, techniques for performing a hash join, a sort merge join, a join via semijoin, and nested loop join are well known to those of skill in the art.
For a join between two relations A and B on the condition A.x=B.y, the following standard notation is used:
Inner join: A-B. The inner join consists of all pairs (a,b) with aεA, bεB such that a.x=b.y. The inner join is commutative: A-B equals B-A up to position swap.
Left outer join: A→B. The left outer join consists of the inner join plus the left exception join. These are the pairs (a,-) for those aεA such that no bεB satisfies a.x=b.y. It is used if the content of A needs to be preserved by the join. Every aεA occurs as a first component either in the inner join or in the left exception join. Left outer joins in general are not commutative.
Right outer join: A←B. The right outer join consists of the inner join plus the right exception join. These are the pairs (-,b) for those bεB such that no aεA satisfies a.x=b.y. The content of B is preserved. The right outer join A←B equals the left outer join B→A up to position swap.
Full outer join: A←→B. The full outer join consists of the inner join plus the left exception join plus the right exception join. It is the set union of A→B and A←B. Full outer join is commutative: A←→B equals B←→A up to position swap.
One tool used for working with join queries is a query graph. The query graph is obtained from the query by generating a vertex for every relation in the join and forming an edge between any two vertices joined by a join condition. The query graph contains a distinguished set of nodes called a target. The target corresponds to those relations where the exact projection of the join needs to be determined. The target comprises those relations having attributes requested in the “select” clause of the query's select statement. Hence it is always non-empty. It is always connected.
Search and classification systems, such as the SAP NetWeaver Text Retrieval and Extraction (TREX) system, typically evaluate joins using only semijoin techniques, which in turn are usually used only for evaluating inner joins. A method of performing a semijoin, and its extension for outer joins, will now be described.
In the semijoin technique, for every relation A in the join and every join J in which A participates, a “tuple list” T(A,J) is created. This list contains “real” pairs (a,a.x), where a is a row identifier within A and x is the attribute of A referenced by J, and “virtual” pairs such as (a,x=−1) or (a=−3,x=−3). Virtual pairs are characterized by a negative second component. Virtual pairs of special importance are the connector pairs. For simplicity, it can be assumed that the two attribute names joined by a join condition are identical, such that the join condition can be written as a.x=b.x, and simply referred to as “the join x”. Correspondingly, T(A,J) can be written as T(A,x).
A semijoin reduction step, used for reducing T(A,x) (where x is the join a.x=b.x between A and B) includes the following steps. At B, the list L of all occurring values b.x is constructed. L is sent to A over a communication line from the host of B to the host of A. We distinguish:
Case 1: There is no arrowhead at B. That is, x is either the inner join
or the right outer join
In this case, all tuple lists at A are reduced. For every real pair (a,a.x) whose attribute value a.x is not present in L, the pairs having the first component a are removed from all tuple lists at A. In every tuple list there is at most one such pair.
Case 2: There is an arrowhead at B. That is, x is either the left outer join
or the full outer join
In this case, T(A,x) is switched. Every real pair (a,a.x) whose attribute value a.x is not present in L is switched to a virtual pair with the same row identifier, such as (a,x=−1).
For brevity, a semijoin reduction step can be referred to as a move.
A relation R influences a tuple list T if R can decrease the number of real pairs in T (by reducing or switching them). If R influences T(A,x), it can be said that R influences A. For any given tuple list T, it is a straightforward topological calculation, performed on the query graph, to determine which relations influence T. As a consequence, for any relation R in the target of the query graph, the set of relations needed to fully reduce R is then obtained as the simple union of those relations needed to fully reduce all of R's tuple lists. The relations needed to fully reduce one tuple list are obtained as a specific subset of the relations influencing it.
Within a sequence of semijoin reduction steps, a single semijoin reduction step reducing a tuple list T is admissible if this move transports at least one relation R to T as a reducer, such that: 1) R influences T; and 2) T has not been reduced by R before. A sequence of semijoin reduction steps is called admissible if each of its steps is admissible.
An important part of the query graph construction is also to determine the effects of all possible semijoin reduction steps in all admissible sequences. These effects in general depend on the types of all joins in the given query. The effects of a move are to reduce the cardinalities of some sets of reducers and of the tuple lists involved.
Outer Join Evaluation by Semijoins
With the above concepts in mind, the semijoin algorithm can be extended for the evaluation of outer joins. In accordance with one method, executing an outer join includes performing semijoin reduction steps, each semijoin reduction step switching one tuple list or reducing all tuple lists of one relation, until a predefined criterion is met. The predefined criterion can include every target relation being reduced by all the necessary relations. The method further includes adding virtual connector pairs to the tuple lists, and sending all the tuple lists to a common host for assembling the join result.
In some implementations, the predefined criterion for stopping the performing semijoin reduction steps may be changed. For example, the semijoin reduction steps may be performed so long as the reduction steps are “profitable,” i.e., where profitability is defined in terms of some appropriate cost measure that need not necessarily refer to relation size. Further, in the assembly step, the virtual pairs are concatenated using a domino principle and then spliced together to obtain the resulting tuples of the join.
In general, the technique of semijoin reduction is particularly well suited for a distributed environment in which relations are usually located on different hosts. In such a situation, an important goal is to minimize communication costs among hosts. Communication costs are high where large intermediate joins are executed among distributed hosts, and then transmitted over a communication line.
Strictly speaking, the above description of the semijoin reduction technique is not yet an algorithm, since it does not describe the sequence in which to execute the semijoin reduction steps. This leaves a considerable degree of freedom, since in most cases a given join can be evaluated using many different execution plans. That is, many sequences are admissible. Further, all of these sequences lead to the same final reduction state of the tuple lists involved, indicated by the stopping criterion described above. However, the communication cost of obtaining the final reduction state may vary greatly, up to several orders of magnitude. As a practical approximation, the cost of a reduction sequence may be understood to consist only of the communication cost, which is the dominant component, and smaller contributions to the cost can be neglected. Communication costs thus have a major influence on the overall execution time of the query.
Conventional search and classification systems and database management systems use only dynamic programming to optimize joins in terms of costs. FIG. 1 shows an exemplary algorithm 10 for optimizing a join J involving N tables using dynamic programming. The dynamic programming algorithm is essentially a brute force search over a very large search space. Such a brute force search is justified only in the case that no other options are available. Variants of the search algorithm attempt to gain better results by restricting the search space to special forms of the evaluation tree.
What is needed is a query optimizer for minimizing communication costs of join execution processes, in particular for the variant of the optimization problem which is restricted to involve only semijoin reductions.