1. Field of the Invention
This invention relates to query mechanisms for relational database management systems. More particularly, the present invention relates to a novel method of evaluating SQL "full outer joins", which methodology enables the use of arbitrary join conditions in specifying the query.
2. Description of the Related Art
A relational database management system, or RDBMS, is a computer implemented database management system that uses relational techniques for storing and retrieving data. Relational databases are computerized information storage and retrieval systems in which data in the form of tables, formally denominated as "relations", are typically stored for use on disk drives or similar mass data stores. A "table" includes a set of rows, formally denominated "tuples" or "records", spanning several columns. Each column in a table includes "restrictions" on the data contents thereof and may be designated as a primary or foreign key. Reference is made to C. J. Date, An Introduction to Database Systems, 6th edition, Addison-Wesley Publishing Co. Reading, Mass. (1994) for a comprehensive general treatment of the relational database art.
The tables are typically stored on direct access storage devices (DASD) such as magnetic or optical disk drives. Data may be retrieved from this type of database by a variety of methods. For example, a computer program can extract information from the database without human intervention. This is the case in batch or embedded processing. In the latter example, queries are embedded in the compiled program, and when executed, return an answer set to the compiler, which then integrates the answer set into the remainder of the compilation. Alternatively, a user can interact with a query system program which serves as a front-end to the database system. "Accessing a table" is a term used in the art to mean reading information from a table. Since the tables are normally stored on DASD, accessing a table requires transferring all of part of the table from DASD into the random access memory, or RAM, of the computer system. When information is needed from a plurality of tables, the tables may be joined by the database software of firmware.
A RDBMS is structured to accept commands to store, retrieve and delete data using high-level query languages such as the Structured Query Language, or SQL. The term "query" denominates a set of commands for retrieving data from a stored database. The SQL standard has been promulgated by the International Standards Association since 1986. Reference is made to the SQL-92 standard "Database Language SQL" published by the ANSI as ANSI X3.135-1992 and published by the ISO as ISO/IEC 9075: 1992 for the official specification of the 1992 version of the Structured Query Language. Reference is also made to James R. Groff et al. (LAN Times Guide to SQL, Osborne McGraw-Hill, Berkeley, Calif., 1994) for a lucid treatment of SQL-92, and to Don Chamberlin, Using the New DB2, Morgan Kaufman Publishers, Inc, 1996, for further discussion of that language.
As used herein, an RDBMS "query" refers to a set of user commands for retrieving data from the stored database. The query language requires the return of a particular data set, sometimes referred to as "the answer set", in response to a particular query, but the method of query execution, or "Query Execution Plan", employed by the RDBMS is not specified by the query. There are typically many different useful execution plans for any particular query, each of which returns the required data set. For large databases, the execution plan selected by the RDBMS to execute a query must provide the required data return at a reasonable cost in time and hardware resources.
Conceptually, the SQL SELECT operation forms a Cartesian product of the tables specified in the FROM clause and then selects the tuples that satisfy the predicates specified in the WHERE clause. As a result of this selection some tuples of the input tables may not appear in the output; i.e., the operation loses some of the input tuples. A variant of the SQL SELECT operation, called the OUTER JOIN, has been defined, which does not lose any of the input tuples. For a discussion of this variant, reference is made to IPSO-ANSI92, the Working Draft: Database Language SQL2 and SQL3; X3H2; ISO/IEC JTC1/SC21/WG3.
Joins allow additional information to be obtained across tables in a meaningful way. A simple example would be that a table of employee information lists an employee's department number as `76`, but the definition of department `76` requires reference to another table, the Department-table, which lists the full department title associated with each department number. In this second table the row for department `76` also contains a column with the department title `Information Systems Department`. Thus, a user desiring to generate a report containing a list of all employees including their department titles may want to establish a join relationship between the department number column in the Employee-table and department title column in the Department-table, so that the employees' department can be printed in the title form instead of the numerical form. Ways of specifying and performing joins are the subject of substantial effort. This is because database tables can be very large and processing tables can be expensive in terms of computer resources. Therefore, it is important that methods for joining tables be efficient.
SQL supports outer and inner joins. The semantics of the outer join are outlined as follows: An outer join involves a join of two tables, which are referred to as the left table and the right table. An outer join differs from an inner join in that it includes rows that have no "partners"--that is, rows from the left table that have no matching rows in the right table, or vice versa. Several subcategories of the outer join operation have been defined: specifically they are the FULL OUTER JOIN, LEFT OUTER JOIN and RIGHT OUTER JOIN. These operations have two operands, the previously discussed left and right tables. If tuples of both tables are preserved, then the operation is called full outer join. If the tuples of only one table are preserved, e.g., the left table or the right table, then the result is respectively a left or right outer join. In the output or answer set, for the non-matching tuples of a preserved table, NULL values are assigned to the columns of the other table.
The Left Outer Join
A left outer join includes rows from the left table that have no matching values in the right table. These rows are given null values for the missing data. Consider the following query to tables T1 and T2:
______________________________________ SELECT q1.c1, q1.c2, q2.c1, q2.c2 FROM T1 q1 LEFT OUTER JOIN T2 q2 ON (q1.c1&gt;=q2.c2) ______________________________________
This query returns all the matching pairs of T1 and T2 tuples based on the join condition "q1.c1&gt;=q2.c2", just as in a regular join. The query also returns rows of T1 which do not match any T2 rows based on the join condition. In this example, T1 is called the tuple-preserving operand, while T2 is called the null-producing operand, because T1 tuples will be preserved when there is no match in T2.
The Right Outer Join
The right outer join is the same as left outer join, except that the join operands, i.e., the participating tables, are commuted. Thus, the right outer join includes rows from the right table that have no matching values in the left table. These rows are again given null values for any missing data elements. Accordingly, the following two outer joins are equivalent:
______________________________________ SELECT q1.c1, q1.c2, q2.c1, q2.c2 FROM T1 q1 LEFT OUTER JOIN T2 q2 ON (q1.c1&gt;=q2.c2) SELECT q1.c1, q1.c2, q2.c1, q2.c2 FROM T2 q2 RIGHT OUTER JOIN T1 q1 ON (q1.c1&gt;=q2.c2) ______________________________________
The Full Outer Join
A full outer join includes both those rows from the left table that have no matching values in the right table, as well as those rows from the right table that have no matching values in the left table. In either case, rows lacking matching values in the opposite table are given null values for the missing data. Consider the following query:
______________________________________ SELECT q1.c1, q1.c2, q2.c1, q2.c2 FROM T1 q1 FULL OUTER JOIN T2 q2 ON (q1.c1&gt;=q2.c2) ______________________________________
This query returns all the matching pairs of T1 and T2 tuples based on the join condition "q1.c1&gt;=q2.c2". The query also returns rows of T1 which do not match any T2 rows based on the join condition as well as rows of T2 which do not match any T1 rows.
Note that a full outer join is not equivalent to a union of a left outer join and a right outer join. It will therefore be appreciated that the full outer join defined above is not identical to the following query:
______________________________________ SELECT q1.c1, q1.c2, q2.c1, q2.c2 FROM T1 q1 LEFT OUTER JOIN T2 q2 ON (q1.c1&gt;=q2.c2) UNION ALL SELECT q3.c1, q3.c2, q4.c1, q4.c2 FROM T1 q3 RIGHT OUTER JOIN T2 q4 ON (q3.c1&gt;=q4.c2) ______________________________________
This is due to the fact that the regular join portion of the full outer join is performed twice: once by the left outer join and again by the right outer join, thereby leading to an incorrect result.
Many database management systems support left and right outer joins. Some systems further support full outer joins, but they impose the strict limitation that only the conjunction of equality predicates is allowed as the join condition. This limitation is due to the fact that these systems implement the full outer join using a modified merge join algorithm which, prior to the making of the present invention, has inherently been restricted to the use of equality predicates.
Efforts at outer join optimization have been made by others. Some of the inventors of the present invention teach outer join optimization in U.S. Pat. No. 5,557,791. Further, P. Goel and B. Iyer address this matter in "SQL Query Optimization: Reordering for a General Class of Queries.", published in the Proceedings of the ACM SIGMOD '96 International Conference on Management of Data, Montreal, Canada, pp 47-56, (1996). Finally, G. Bhargava, P. Goel and B. Iyer present another methodology in "Hypergraph based reorderings of outer join queries with complex predicates.", Proceedings of the ACM SIGMOD '95 International Conference on Management of Data, San Jose, Calif., pages 304-315, (1995).
In general, each of these references regarding outer join implementation shares common theme: the strategies taught therein require special runtime operators, which somehow keep track of which tuples must be preserved. In the prior art this has typically been done by creating and maintaining some list of tuple identifiers, or TIDs, for those tuples which must be preserved. When the outer join condition involves only equality predicates in the conjunctive normal form, it is possible to implement the full outer join via a modified merge join algorithm without keeping lists of preserved tuple identifiers. However, a modified merge join, according the prior art, cannot be used for arbitrary join conditions, including but not necessarily limited to the inequality predicate and disjunction, as specified in IPSO-ANSI92. Moreover, the creation and maintenance of TID lists introduces an unwanted and now unnecessary element of computational overhead, thereby inducing system performance problems in any system which relies on such lists.
Join conditions are well defined in IPSO-ANSI92. Typically, join conditions defining the relationship between tuples are logically selectable from the following set of predicates:
is null; PA1 like; PA1 equals (=); PA1 does not equal (.noteq.) PA1 is greater than (&gt;); PA1 is less than (&lt;); PA1 is greater than or equal to (.gtoreq.); and PA1 is less than or equal to (.ltoreq.).
A plurality of these conjunction predicates can also be logically joined by the logical operators AND, OR, and NOT. Accordingly, as used herein, the term "arbitrary join conditions" defines any of these join conditions, or a plurality of join conditions joined by one of the defined logical operators. This precisely defined term defines over prior art full outer join methodologies, which support only the equality predicate.
Finally, the computation of large queries implementing a full outer join can be resource intensive. This problem is often exacerbated by increasingly complex query structures where the union of a plurality of outer joins repeatedly returns a number of instances of a matching pair. Any solution to the problem of using a modified merge join to implement a full outer join and which further enables the use of arbitrary join conditions should, to the greatest extent possible, minimize the impact of the query on system resources by only returning the first instance of a matching pair.
Accordingly, there is a clearly-felt need in the art for a methodology which enables the use of a modified merge join to implement a full outer join which enables the use of arbitrary join conditions other than the equality predicate. There is a further need that this query methodology return only a first instance of a matching pair in response to a join condition, thereby enabling the methodology to be resource efficient. These unresolved problems are clearly felt in the art and are solved by this invention in the novel manner described below.