Relational databases are organized into tables which consist of rows and columns of data. The rows are formally called tuples. A database will typically have many tables and each table will typically have multiple tuples and multiple columns. The tables are typically stored on random access storage devices (DASD) such as magnetic or optical disk drives. Data may be retrieved from this type of database by a variety of methods. For example, a computer program can extract information from the database without human intervention or a user can interact with a query system program which serves as a front-end to the database system. "Accessing a table" is used in the art to mean reading information from a table. Since the tables are normally stored on DASD, accessing a table requires transferring all or part of the table from DASD into the RAM of the computer system. When information is needed from a plurality of tables, the tables may be joined by the database software or firmware. Joins allow additional information to be obtained across tables in a meaningful way. A simple example would be that a table of employee information lists an employee's department number as `76`, but the definition of department `76` requires reference to another table, the Department-table, which lists the full department title associated with each department number. In this second table the row for department `76` also contains a column with the department title "Information Systems Department." Thus, a user desiring to generate a report containing a list of all employees including their department titles may want to establish a join relationship between the department number column in the Employee-table and the department title column in the Department-table, so that the employee's department can be printed in the title form instead of the numerical form. Ways of specifying and performing joins are the subject of substantial effort. Database tables can be very large and processing tables can be expensive in terms of computer resources. Therefore, it is important that methods for joining tables be efficient.
The term cursor is used in the art for a mechanism for positioning a pointer to a tuple of a table. To `open a cursor on a table` means position the cursor to the first row of the table that satisfies the selection criteria if any. At any point of time, there might be many cursors open on the same table, each may be positioned on a different (or the same) tuple.
The outer join operation is being introduced in major relational database systems, and is already proposed as part of the emerging SQL2 standard. (See ISO-ANSI Working Draft: Database Language SQL2 and SQL3; X3H2/90/398; ISO/IEC JTC1/SC21/WG3, 1990). However, there has not been a significant study of efficient execution methods for this kind of operation. In this specification, we describe a general method, and a more efficient method, called specialized method, which works for most of the common cases. One important property of these methods is that they allow several existing efficient join methods to be extended to support the outer join operation.
Conceptually, the SQL SELECT operation forms a cartesian product of the tables specified in the FROM clause and then selects the tuples that satisfy the predicates specified in the WHERE clause. As a result of this selection, some tuples of the input tables may not appear in the output; i.e., the operation loses some of the input tuples. A variant of the SQL SELECT operation, called outer join, has been defined, which does not lose any of the input tuples. (see 1. ISO-ANSI Working Draft: Database Language SQL2 and SQL3, supra; 2. C. Date, RELATIONAL DATABASE; SELECTED WRITINGS; Addison-Wesley Publishing Co. 1986). Several subcategories of the outer join operation have also been defined, called full outer join, left outer join, right outer join, and full natural outer join. Except for the full natural outer join, the rest are well defined only if two tables are specified in the FROM clause. The two tables are called left and right tables. If tuples of both tables (respectively, the left table or the right table) are preserved the operation is called full (respectively, left, right) outer join. Suppose we have two tables, T.sub.1 (C.sub.11, . . . , C.sub.1n) and T.sub.2 (C.sub.21, . . . , C.sub.2m). Full natural outer join is the counterpart of natural join, where the join predicate has the form C.sub.11 =C.sub.21 =. . . .LAMBDA.C.sub.12 =C.sub.22 =. . . .LAMBDA.. . . . The columns C.sub.1j, C.sub.2j, . . . are called join columns. These columns appear in the output only in the form of VALUE(C.sub.1h, C.sub.2h, . . . ), where the VALUE function returns the first non-NULL argument. The full natural outer join allows any number of tables as its operands.
In the following examples, we use a syntax close to the one proposed in ISO-ANSI Working Draft: Database Language SQL2 and SQL3 (supra):
FROM T.sub.1 FULL JOIN T.sub.2 ON &lt;Predicate&gt;
&lt;Predicate&gt; has the same format as the WHERE clause (a mixture of ANDs, ORs, and NOTs of conditional expressions, which include subquery predicates) of SQL statements. For a given outer join operation containing T.sub.1 and T.sub.2, there may be other predicates applied to T.sub.1 and T.sub.2 before the operation, and predicates applied to the result of the operation. Application of these predicates is done using the SQL SELECT operation, and is not part of the outer join operation as far as execution methods are concerned. In this specification, we explain methods which are specific to the outer join operation.
For example, suppose we have two tables, cs and ls (current sales and last year sales). Each of these tables have the following columns:
______________________________________ pno /*product number (unique key)*/ ptype /*product type */ sales /*amount of sales*/ profit /*amount of profit*/ ______________________________________
We would like to formulate the following query (Query 1). Give all the current sales information, and last year sales information for products of the same type, if the current year sales are not profitable and the last year's sale amount is equal to or less than the current sale amount. Another way of saying the same query is as follows. Give all the current sales information and if they are not profitable then the last year sales information for products of the same type if their sale amount is equal to or less than the current sale amount. We refer to this example later in the description of the methods. For this reason, we chose to use the RIGHT outer join syntax.
______________________________________ Query 1: ______________________________________ SELECT * FROM ls RIGHT JOIN cs ON ( ls.ptype = cs.ptype AND ls.sales .ltoreq. cs.sales AND cs.profit .ltoreq. 0 ) ______________________________________
Note that all the tuples of cs table are preserved. One may wish to preserve all the tuples of ls table as well. For this, FULL outer join must be specified. In the above query, one or many of the predicates may be omitted to generate variations. All these variations have reasonable meanings. A variant of Query 1 is:
______________________________________ Query 2: ______________________________________ SELECT * FROM ls RIGHT JOIN cs ON ( ls.pno = cs.pno AND ls.sales .vertline..ltoreq. cs.sales AND cs.profit .ltoreq. 0 ) ______________________________________
In this example a unique key column, pno, is used as a join column.
Predicate Forms
As mentioned before, the &lt;Predicate&gt; of the ON clause of the outer join has the same format as the WHERE clause (a mixture of AND, ORs, and NOTs of conditional expressions, which include subquery predicates) of SQL statements. Further, while our General Method handles the ON clause with this generality, our special methods can handle a set of special cases, but with more efficiency and simplicity. For example, if the ON clause is only an equality predicate, then our Special Method 1 is a much better choice. For this purpose, we define, without loss of generality, a form for the predicates in the ON clause. Next we identify a set of special cases, called predicate categories, based on this form. Later, we identify which of these categories can be handled more efficiently by our special methods.
Suppose we have two tables, T.sub.1 (C.sub.11, . . . , C.sub.1n) and T.sub.2 (C.sub.21, . . . , C.sub.2m). Let us assume, without loss of generality, that T.sub.1 is the outer and T.sub.2 is the inner of the join. Let us also assume that the ON clause has the following form.
C.sub.11 .theta.F.sub.11 (C.sub.21, . . . , C.sub.2m, V.sub.1, . . . , V.sub.h).OMEGA.
C.sub.11 .theta.F.sub.11 (C.sub.21, . . . , C.sub.2m, V.sub.1, . . . , V.sub.h).OMEGA.. . .
C.sub.1k .theta.F.sub.k1 (C.sub.21, . . . , C.sub.2m, V.sub.1, . . . , V.sub.h).OMEGA.
C.sub.1k .theta.F.sub.k2 (C.sub.21, . . . , C.sub.2m, V.sub.1, . . . , V.sub.h).OMEGA.
P.sub.12 (C.sub.11, . . . , C.sub.1n, C.sub.21, . . . , C.sub.2m, V.sub.1, . . . , V.sub.h).OMEGA.
P.sub.1 (C.sub.11, . . . , C.sub.1m, V.sub.1, . . . , V.sub.h).OMEGA.
P.sub.2 (C.sub.21, . . . , C.sub.2m, V.sub.1, . . . , V.sub.h).OMEGA.
Note that all these predicates are part of the outer join operation, and they are not the predicates that have to be applied before or after the outer join operation. Outer join predicates must be applied to a pair of tuples from the outer and the inner tables, even if the form of such predicates look like local predicates. Except for P.sub.2, the rest are join or local predicates on T.sub.1. .theta. may be =, &gt;, .gtoreq., &lt;, .ltoreq.. The predicates involving .theta.s are called special predicates. We will use these predicates in a special way in the methods. .OMEGA. may be the Boolean operator 1 or v. Fs are scalar functions (more precisely, they must return at most a single value, as we will see below). The evaluation of these functions must not depend on the current cursor position of the outer table. For example, the scalar functions can be arbitrary expressions on the columns of inner tuples. The arguments V.sub.1, . . . , V.sub.h may be bindings passed to this outer join operation, which include, constants, host variables, and correlated column references. F functions may also involve subqueries. Again, evaluation of these subqueries must not depend on the current cursor position of the outer table. V.sub.1, . . . , V.sub.h may be the result of scalar subqueries. If a scalar subquery returns more than one value, then it is an error. If a scalar subquery returns no value then it is assumed that the subquery returns a NULL value. Later, we will discuss how we deal with these special cases. F function may be a subquery of the form: C.sub.1k .theta. ALL (&lt;subqueries&gt;), where .theta. may be =, &gt;, .gtoreq., &lt;, .ltoreq.. The above form of ALL subqueries can be converted such that the subquery returns at most one value, satisfying our constraint on the F functions. For example, if .theta. is &gt;, then the subquery can return the maximum value only. If the subquery returns no value, then the predicate is TRUE.
P.sub.12 can be any predicate involving columns of both tables. For example, EQU (C.sub.15).sub.2 +C.sub.23 =5 EQU OR EQU C.sub.11 +C.sub.21 IN (SELECT . . . FROM . . . WHERE . . .)
P.sub.1 and P.sub.2 are local predicates on T.sub.1 and T.sub.2. The predicate on profit in Query 1 and 2 is an example of a P.sub.2 predicate. The rest are a set of join or local predicates on T.sub.1. We define different categories of predicates:
Category 1: Conjuncts of special equality predicates:
all .OMEGA.s are .LAMBDA.s, .theta.s are equalities, and we allow P.sub.12 to exist if the following two conditions hold. First, C.sub.11, . . . , C.sub.1k form a unique key of T.sub.1 table. Second, none of the above columns have associated with them only an F.sub.ij which is a universal quantifier subquery (ALL subquery). Note that in this case, the special equality predicates on the columns not part of the unique key are also considered to be part of P.sub.12. Example Query 2 falls into this category since pno is a unique key of the outer relation. Also, Query 1 falls into this category if we eliminate the predicate ls.multidot.sales.ltoreq.cs.multidot.sales. This is the most common form of outer join predicates, and our methods are tuned for this case. Note that we allow multiple special equality predicates for a given column. PA1 all .OMEGA.s are .LAMBDA.s, each .theta. is one of =, &gt;, .gtoreq., &lt;, .ltoreq., and P.sub.12 does not exist. Example Query 1 falls into this category. PA1 This category is handled by the general method described below, and will not be discussed further.
Category 2: A predicate is not in category 1 and contains conjuncts of special equality or inequality predicates:
Category 3: None of the above.
First for background information we will discuss a simple method, called the general method, that handles all the categories. Throughout this discussion we assume the joins are done pair-wise. The sequential general method is as follows. Assume LOCP.sub.1 and LOCP.sub.2 are non-outer-join local predicates on T.sub.1 and T.sub.2 respectively. The inputs to the outer join operation are T.sub.1 and T.sub.2 after application of the local predicates. Here we assume the query spec is: apply LOCP.sub.1 predicate on table T.sub.1 and apply LOCP.sub.2 predicate on table T.sub.2 then do the outer join of the results using the predicate OUTJP. For the natural full outer join, we may have more than two tables.
As used in the following descriptions a TID is a tuple identifier. Typically, it consists of two parts: a page ID concatenated to a slot number (an index into an array). By accessing the particular slot in the array at the bottom of the particular page, the pointer to the actual location of the tuple on the page can be obtained. A TID-list is a list of TIDs obtained by accessing an index. The index entries are of the form &lt;key value, TID&gt;.
When we say that results of an operation are output, this simply means that the results are either displayed to a user or stored in some form of memory in the computer (RAM, DASD, TAPE, etc.) for subsequent use.
______________________________________ General Sequential Method ______________________________________ Execute the query; Output its results (may be piped to the next stage); During this execution, form one list (TIDS.sub.-- list) with distinct TID or eliminate duplicates later; For each preserved table, {Sort the list and eliminate duplicates (if TIDs are not distinct); Scan the associated original table (e.g., T.sub.1); {For each tuple {if its TID is in TID.sub.-- list and LOCP.sub.1 is true {output the tuple with NULL values for the columns of the other table; } } } } ______________________________________
Note that the second part of the method does the preservation. This part is very similar to a merge join of the table and the sorted TID list. This method can be further optimized if we scan a table during the first phase. Suppose T.sub.1 is scanned in the first phase. In the TID list of T.sub.1, we remember all the TIDs that satisfy predicate LOCP.sub.1, and we mark the ones that are part of the output (i.e., satisfied the JOINP join predicate). During the second phase, we just need to scan this list, get the TIDs that are not marked, and preserve their associated tuples.
As mentioned before, the advantage of this method is that it handles all categories of predicates explained above. However this method has several disadvantages:
1. One major disadvantage is the overhead of reaccessing the input tables in the second part of the method. Note that this requires the input table to be stored. Therefore, we cannot pipeline the input tables. Pipelining is much more efficient in many cases.
2. Output does not preserve ordering. Suppose the tables T.sub.1 and T.sub.2 were outer joined, and a merge join was done in the first step of the general method. Therefore, the order of T.sub.1 is preserved at the output. However, the second step of the method adds more tuples to the output. As a result, we lose this ordering.
Our specialized method described below does not have these disadvantages, handles most of the cases, and is a more efficient method than the general method. The left outer join can be handled easily by an extension to a class of existing join methods such as nested loop, merge scan, hash join, hybrid join, etc. All these methods scan the outer table (at some stage) once and find the matching inner tuples. At a given scan position, if no tuple of the inner table matches the outer tuple, the outer tuple is output with NULL values for the inner columns. Handling of the full outer join, which requires preserving the inner tuples also, is more involved. In the full outer join, preservation of the outer tuples can be done in the same way as the left outer join. The preservation of the inner tuples requires more work. The same is true for the right outer join. Note that although semantically we can convert the right outer join to the left outer join, we may not want to do this because it might be more expensive. Even a left outer join query may have to be executed as a right outer join for cost reduction. An example of the case that right outer join is preferable is when the table that has to be preserved has a good access path, and the other table is large and does not have a good access path, therefore, we want to scan it only once. If this table is the inner, then for each tuple of the outer we have to scan it, resulting in very poor performance.
As mentioned before, the full outer join is a combination of the left and the right outer joins. We have shown a simple and efficient way to preserve the tuples of the outer table of an outer join operation. The rest of this specification describes methods for preservation of the tuples of the inner table.
In published Japanese patent application titled "OUTER JOINING OPERATION SYSTEM OF RELATIONAL DATA BASE", JP 59-125461, an improvement of a basic method for outer join is described. The basic method will perform full outer join in 4 stages: 1. inner join in a temporary table; 2. left outer join in a temporary table; 3. right outer join in a temporary table; 4. retrieve from the temporary table. The described improvement will eliminate the use of the temporary table. The present invention is completely different from this as will be seen below.
C. Date (RELATIONAL DATABASE; SELECTED WRITINGS, supra), A. Pirotte (A Formal Definition of the Relational Model, ACM SIGMOD Record, 13(1), 1982) and the ISO-ANSI draft (ISO-ANSI Working Draft: Database Language SQL2 and SQL3, supra) define the syntax and semantics for outer join. A. Rosenthal and D. Reiner (Extending the Algebraic Framework of Query Processing to Handle Outerjoins, Proc. 10th International Conference on Very Large Data Bases, Singapore, August 1984) address the execution problem for outer join. Their paper covers the case where the left table needs to be preserved, which is the same as the simple extension explained earlier. However, the paper neither allows the preserved table to be the inner, nor provides a method for full outer join, which are features of the present invention. The paper mentions that these cases require extra processing with no further detail. In this application, we have provided methods to handle all these cases. Further, we have described two specialized methods, which can be used as an extension to a class of known efficient join methods (e.g., nested loop, merge join, hybrid join, and hash join) with minimal extra processing.
No reports of a method for parallel execution of outer join have been found in the literature.