This disclosure relates generally to computer-based mechanisms for performing join operations, and more particularly to techniques for using semijoin reduction operations to resolve outer join queries.
In relational databases, a relation represents a set of entities such as customers or products as an association between properties of the entities called attributes. An element of a relation is an ordered set of associated properties called a tuple. A relational database contains tables of data, and each table implements a relation. Each row in the table represents a tuple in the relation. Each column represents a value range of an attribute, such that each cell in the column contains a value, which may be a repeated value or a null value.
A central task in enterprise data processing is to extract from a relational database the set of tuples that meet some given logical condition. Often, the task is to extract tuples that each contain values from more than one table in the database. A table for relation A can include a column for the same attribute as another table for relation B. In such cases, the tables A and B can be joined logically by matching rows via the values of this attribute. To extract a specified set of tuples that include properties from several tables, the respective tables need to be joined.
Search and classification systems, such as the SAP NetWeaver Text Retrieval and Extraction (TREX) system, perform many tasks, including extraction of tuples that meet given logical conditions. In such cases, the task is called a search request or a database query, and the set of tuples that meet the logical condition specified in the query is called the result set. In order to perform such tasks with the required efficiency, a system like TREX uses highly optimized techniques for joining tables. One such technique is described in more detail below.
In many cases, join operations are computationally expensive. For this reason, much research has been devoted to devising efficient algorithms for resolving them. For example, techniques for performing a hash join, a sort merge join, a join via semijoin, and nested loop join are well known to those of skill in the art. Some systems such as TREX may evaluate joins via semijoins, particularly if every relation is located on a different host. An exemplary semijoin technique can be executed as follows.
For a single join J of the form A.x=B.y, the relation A is reduced by J as follows: At B, a list L of all occurring values b.y. is constructed, L is sent to A over a communication line from the host of B to the host of A, and all rows a whose value a.x is not in L are deleted from A. This is called a semijoin reduction step.
Semijoin reduction steps are performed until the relations are no longer further reduced (or until none of the possible semijoin reduction steps is still considered “profitable” by a cost measure that is different from relation size). After the reduction phase, for all attributes A.x occurring in the join the tuple lists T(A,x), which consist of the pairs (a,a.x) where a is the row identifier of a non-removed row of A and a.x, the value of the attribute A.x, is sent to a common host where the result of the join is assembled.
One tool for working with complex joins is called a query graph. The query graph is obtained from the query by drawing a vertex for every relation in the join and an edge between any two vertices joined by a join condition. In the assembly process, it may be assumed that the query graph is connected, since the solution of the whole query is simply the Cartesian product of the solutions of its connected components, which themselves are query graphs. The assembly process then works as follows.
The vertices of the query graph are ordered into a sequence A1, A2, . . . , An, such that for every i>1, relation Ai is joined to at least one relation Aj with j<i. Iteration is started on A1. For a1ε A1, a 1-tuple (a1) is formed. Assuming, in general, that the construction of 1-tuples, 2-tuples, . . . , i-tuples for some i with 1<=i<n has been defined, construction of (i+1)-tuples can proceed as follows.
Let B1, B2, . . . , Bk be those relations among A1, A2, . . . , Ai with which A(i+1) is joined, k>=1 holds by vertex ordering. If the query graph is a tree, k=1 will always holds. In any case (also for k>1) any given i-tuple (a1, a2, . . . , ai), let b1 ε B1, b2 ε B2, . . . , bk ε Bk be the corresponding elements in this i-tuple. Assume the join between Bj and A(i+1) to be Bj.xj=A(i+1).yj, j=1, . . . , k, the sets Mj of pairs in T(A(i+1),yj) satisfying a(i+1).yj=bj.xj can be found, and their intersection D can then be formed. If D is empty, consider the next tuple (a1′, a2′, . . . , ah′), for h h chosen to minimally comprise B1, B2, . . . , Bk (so 1<=k<=h<=i), extend it to an i-tuple (a1′, a2′, . . . , ai′) and repeat the process. If D is not empty, iterate on D. For a(i+1) ε D, form an (i+1)-tuple (a1, a2, . . . , ai, a(i+1)).
The general assembly process can be illustrated in a simple example. Let the query graph be the three-vertex chain:
A-B-C
with join conditions A.x=B.x and B.y=C.y (note the simplifying identity of attribute names in join conditions). The relation order A, B, C can be used. A common value of a.x and b.x can be represented in shorthand as x. Starting with any a ε A, a 1-tuple (a) exists. Next, b ε B is determined such that there are pairs (a,x) and (b,x). The 1-tuple (a) can be extended to a 2-tuple (a,b). Then, c ε C is determined such that there are pairs (b,y) and (c,y). The 2-tuple (a,b) can finally be extended to a 3-tuple (a,b,c).
In summary, 4 pairs (a,x), (b,x), (b,y), (c,y) of row identifiers and attribute values are used to construct the 3-tuple (a,b,c) of row identifiers. If the order in every second pair is reversed, i.e. (a,x), (x,b), (b,y), (y,c), then the domino principle becomes evident.
The technique of semijoin reduction is well suited for a distributed environment where the relations may be located on different hosts. In this case, the usual aim is to minimize communication costs, which aim is furthered in that no large intermediate joins are generated to be transmitted over a communication line, and the list L of occurring values may be (cumulatively) subject to different compression techniques. However, conventional semijoin techniques are used only for evaluating inner joins, and are not suitable for outer join resolution operations.