1. Field of the Invention
This invention relates in general to systems, methods, and computer programs in the field of processing database queries in database management systems (DBMSs), including relational, hierarchical, and object-oriented DBMSs, and in particular to handling NULL values in SQL queries over object oriented data.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. Relational databases are organized into tables which consist of rows and columns of data. The rows are formally called tuples. A database will typically have many tables and each table will typically have multiple tuples and multiple columns. The tables are typically stored on random access storage devices (DASD) such as magnetic or optical disk drives for semi-permanent storage.
RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American Nationals Standard Organization (ANSI) and the International Standards Organization (ISO). The current SQL standard is known informally as SQL/92.
In RDBMS software, all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host languages, such as C and COBOL. SQL allows the user to manipulate the data.
The definitions for SQL provide that a RDBMS should respond to a particular query with a particular set of data given a specified database content, but the method that the DBMS uses to actually find the required information in the tables on the disk drives is left up to the RDBMS. Typically, there will be more than one method that can be used by the RDBMS to access the required data. The RDBMS will optimize the method used to find the data requested in a query in order to minimize the computer time used and, therefore, the cost of doing the query.
Today, a DBMS can instead be an object-oriented database (OODB), wherein the database is organized into objects having members that can be pointers to other objects. An object can have relationships to other objects. The objects contain references, and collections of references, to other objects in the database, thus leading to databases with complex nested structures.
The integration of object technology and database systems has been an active area of research for the past decade. One important aspect of the integration of these two technologies is the provision of efficient, declarative query interfaces for accessing and manipulating object data. Compared to other aspects of object-oriented database (OODB) technology, such as integrating persistence into object-oriented languages like C++ and Smalltalk, queries were given relatively little attention in the early days of OODB research. Mike Stonebraker, Third Generation Data Base System Manifesto, Computer Standards & Interfaces, 12, December 1991, which is incorporated by reference herein.
In Won Kim, Object-Oriented Database Systems: Promise, Reality, and Future, Proc. 19th International Conference on Very Large Data Bases, Dublin, August 1993, which is incorporated by reference herein, it is pointed out that even today, a number of commercial OODB systems are quite weak in this regard. As the OODB field has developed, however, a number of proposals for OODB query languages have appeared in the database literature including the following publications:
Michael J. Carey, David DeWitt, and Scott Vandenberg, A Data Model and Query Language for EXODUS, Proc. ACM-SIGMOD International Conference on Management of Data, Chicago, June 1988; PA1 Won Kim, A Model of Queries for Object-Oriented Databases, Proc. 15th International Conference on Very Large Data Basses, Amsterdam, August 1989; PA1 Francois Bancilhon, S. Cluet, and C. Delobel, A Query Language for the O.sub.2 Object-Oriented Database System, Proc. 2nd International Workshop on Database Programming Languages, Morgan-Kaufmann Publishers, Inc., Gleneden Beach, June 1989, hereinafter referred to as "BCD89"!; PA1 Jack Orenstein, Sam Haradhvala, and Benson Margulies, Don Sakahara, Query Processing in the ObjectStore Database System, Proc. ACM-SIGMOD International Conference on Management of Data, San Diego, June 1992, hereinafter referred to as "OHMS92"!; PA1 S. Dar, N. Gehani, and H. Jagadish, COL++: A SQL for a C++ Based Object-Oriented DBMS, Proc. International Conference on Extending Data Base Technology, Advances in Database Technology--EDBT '92, Lecture Notes in Computer Science, Springer-Verlag, Vienna, 1992; PA1 Michael Kifer, Won Kim, and Yehoshua Sagiv, Querying Object-Oriented Databases, Proc. ACM-SIGMOD International Conference on Management of Data, San Diego, June 1992; PA1 Tom Atwood, Joshua Duhl, Guy Ferran, Mary Loomis, and Drew Wade, Object Query Language, Object Database Standards: ODMG--93 Release 1.1, ed. R. G. G. Cattell, Morgan-Kaufmann Publishers, Inc., 1993, hereinafter referred to as "ADF.sup.+ 93"!; PA1 Jose Blakeley, William J. McKenna, and Goetz Graefe, Experiences Building the Open OODB Query Optimizer, Proc. ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 1993; PA1 t.A r k PA1 t.A r s.B PA1 t.A r k .fwdarw.(t.A IS NOT NULL and t.A r k) PA1 t.A r s.B.fwdarw.(t.A IS NOT NULL and t.B IS NOT NULL and t.A r s.B) PA1 not (t.A r k).fwdarw.not (t.A IS NULL or t.A r k) PA1 not (t.A r s.B).fwdarw.not (t.A IS NULL or s.B IS NULL or t.A r s.B) PA1 select * PA1 from Emp e PA1 where e.sal&gt;(select avg (e1.sal) from (e.dept.emps) e1) PA1 select * PA1 from Emp e PA1 where e.sal&gt;=all (select e1.sal from (e.dept.emps) e1)
all of which are incorporated by reference herein.
Some commercial object-oriented database systems that offer object query facilities are O2, discussed in BCD89, and ObjectStore, discussed in OHMS92. Each of these object-oriented database systems provides their own flavor of object query language. ObjectStore's query language is an extension to the expression syntax of C++. O2's query language is generally more SQL-like, and has been adapted into a proposed object-oriented database query language standard (ODMG-93) by a consortium of object-oriented database system vendors, ADF.sup.+ 93, which is incorporated by reference herein, but it differs from SQL in a number of respects, Won Kim, Observations on the ODMG-93 Proposal, ACM SIGMOD Record, 23(1), March 1994, hereinafter referred to as "Kim94"! , which is incorporated by reference herein.
Furthermore, it should be noted that SQL3 supports object relational queries, and Illustra Relational Database System includes object oriented SQL extensions. See, the publication by Jim Melton, ed., ISO-ANSI Working Draft: Database Language SQL (SQL3); ISO/IEC SC21 WGS DBL YOW-004 and ANSI XSH2-94-084, ISO.sub.-- ANSI, 1995, hereinafter referred to as "Mel95"!, which is incorporated by reference herein.
In RDBMS software using the SQL-92 query language, columns are of simple atomic types, and columns appear in queries as c or q.c where c is a column of some table and q is a correlation name or range variable. In query languages for Object-Oriented DBMSs, column expressions are replaced with path expressions that allow traversal through pointers, embedded classes and structs, multi-valued collections, and relationships to reach nested attributes.
Mel95 proposes traversal through embedded structures as Abstract Data Types (ADT). For example, if "e" is a correlation, "address" is an ADT of type Address, "zip" is an attribute of address, and the ".." characters are used to express traversal through embedded structures, then "e.address..zip" expresses the traversal through the Address structure to obtain the attribute "zip". It should be noted that the SQL3 draft has been further extended to include support for typed rows, references to rows, and a dereference operator (.fwdarw.).
In RDBMSs, columns of any type can assume NULL (i.e., unknown) values. In RDBMS software, NULL values are properly handled using tri-valued logic (i.e., TRUE, FALSE, or UNKNOWN) for predicates, and SQL-based RDBMSs employ such a logic. The OODB and C++ worlds are different; for example, neither C++ nor the ObjectStore OODBMS nor the ODMG standard has a general notion of NULL values, and their predicates are therefore based upon a two-valued logic (i.e., TRUE, FALSE). Most OODB systems support two-valued logic (TRUE or FALSE), and therefore cannot handle NULL values. See, OHMS92 and Charles Lamb, Gordon Landis, Jack Orenstein, and Dan Weinreb, The Objectstore Database System, Communications of the ACM, 1991, hereinafter referred to as "LLOW91"!; Bjarne Stroustrup, The C++ Programming Language, Addison-Wesley Publishing Co., 1987, hereinafter referred to as "Str87"!; Margaret A. Ellis and Bjarne Stroustrup, Annotated C++ Reference Manual, Addison-Wesley Publishing Co., 1990, hereinafter referred to as "Ell90"!; all of which are incorporated by reference herein.
In the early 1980's, a pre-OODB project the GEM project, discussed in C. Zaniolo, The Database Language GEM, In Proc., ACM-SIGMOD International Conference on Management of Data, 1983, hereinafter referred to as "Zan83"! and S. Tsur and C. Zaniolo, An Implementation of GEM: Supporting a Semantic Data Model on a Relational Backend, In Proc., ACM-SIGMOD International Conference on Management of Data, 1984, hereinafter referred to as "TZ84"!, which are incorporated by reference herein, extended the relational data model and the QUEL query language with entity-relationship features that were quite similar to those found in many of today's object-oriented query languages.
The GEM project was designed as a layer over a RDBMS that added entity-relationship modeling extensions to the relational model. The GEM project, which has a tri-valued logic, was implemented over a RDBMS that had a two-valued logic and did not support NULL values. The GEM project simulated NULL values by requiring an arbitrary value of an attribute to be selected to represent NULL values, as described in Zan83 and TZ84. For example, the value zero could be selected. Then, a test such as (e.sal IS NULL) would be implemented as (e.sal=0).
To translate GEM queries into underlying relational queries, GEM converted its tri-valued logic into two-valued logic, as described in TZ84. The GEM project developed tri-valued to two-valued logic transformations for the following forms of predicate expressions, as described in TZ84:
Here, s and t are correlations over tables, k is a constant, A and B are columns of t and s respectively that can have NULL values, and r is a relational operator among {=, .noteq., &lt;, .ltoreq., &gt;, .gtoreq.}. An example of the first type of predicate above is (e.sal&gt;100), and an example of the second type of predicate above is (e.did=d.no). The previous examples can be scoped by negation: (not (e.sal&gt;100)) and (not (e.did=d.no)) respectively. Negation is determined by counting the number of NOT's from the root of the predicate tree. If the number is odd, the expression is negative, otherwise it is positive. For example, in (not (not e.sal&gt;100)), the subexpression (e.sal&gt;100) is a positive expression, whereas in the expression (not (not (not e.sal&gt;100))), the subexpression (e.sal&gt;100) is a negative expression because it is scoped within an odd number of NOT'S. Positive expressions are replaced as follows:
The predicate (e.sal&gt;100) is transformed into the predicate (e.sal IS NOT NULL and e.sal&gt;100). Similarly, the predicate (e.did=d.no) is transformed into the predicate (e.did IS NOT NULL and d.no IS NOT NULL and e.did=d.no). Negative expressions are replaced as follows:
The predicate (not (e.sal&gt;100)) is transformed into the predicate (not (e.sal IS NULL or e.sal&gt;100)). Similarly, the predicate (not. (e.did=d.no)) is transformed into the predicate not ((e.did IS NULL or d.no IS NULL or e.did=d.no)). If the predicate is not within the scope of negation, the "IS NOT NULL" test guarantees that bindings for which the predicate evaluates to NULL return FALSE. Inversely, if the predicate is within the scope of negation, the "IS NULL" test guarantees that after the application of negation, bindings for which the predicate evaluates to NULL return FALSE.
The GEM project handles simple scalar comparisons, but it does not address predicates involving subqueries that can arise in SQL-92 queries. Subqueries were not an issue for the GEM project, because the GEM query language was based on the QUEL query language, not on SQL. SQL-92 includes predicates of the form: e r Q, and e r .alpha. Q in which e is an expression, r is a relational operator, .alpha. is a qualifier among {ANY, ALL} and Q is a subquery (note: the IN quantified predicate is not detailed here since e IN Q is equivalent to e=ANY Q and e NOT IN Q is equivalent to e.noteq.ALL Q).
The first form of predicate is a basic predicate with a subquery, while the second is a quantified subquery. The following query is an example of a basic predicate with a subquery, and retrieves employees who earn more than the average salary of employees in their departments.
The next query is an example of a query with a quantified subquery and selects employees who have the highest salary of employees in their departments.
Although the GEM query language introduced path expressions (functional joins, in GEM terminology) with the NULL-related semantics, their published transformations in TZ84 did not properly address the handling of NULL values when translating path expressions into QUEL queries. In particular, the transformations presented in TZ84 did not respect the outer-join-like NULL semantics of path expressions that was specified in Zan83.
As noted above, the other shortcoming of GEM was that it did not attempt to handle basic predicates with a subquery or quantified subqueries since GEM was based upon the QUEL query language which does not have subqueries. The SQL-92 query language does include predicates with a subquery including quantified subqueries. SQL-92 also provides an interpretation of the predicates with subqueries if a NULL value is encountered. However, the interpretation specifies when a predicate including a sub-query is NULL, and that is based on tri-valued logic.
The GEM project also did not address NULL pointers. With object-oriented DBMS software, pointer data types are available and attributes of these data types can be NULL. The notions of NULL value and NULL pointer differ in many respects. While NULL values are excluded from most object-oriented data models, they are reintroduced in object-oriented SQL queries where path expressions can include attributes that are NULL pointers. The object-oriented database world, as described in OHMS92 and LLOW91, and the C++ language environment, as described in Str87 and Ell90, do not support NULL values, and their predicates are therefore based upon a two-valued logic. However, both do have pointer types that can be NULL-valued, which leads to NULL-related problems for queries that involve navigation through potentially NULL pointers. Simply ignoring NULL pointers during path expression evaluation could cause the query engine for an object-oriented system (or any other two-valued logic environment having NULL pointers) to fail at run-time. To address this, both the ODMG-93 standard, as described in ADF.sup.+ 93, and ObjectStore, as described in LLOW91 and OHMS92, place the burden of avoiding this problem on the user. Users of these query facilities are expected to include NULL-testing predicates in their queries in order to prevent such cases. If a dereference of a NULL pointer does occur in a query, it is viewed as an error and a run-time exception is raised. This makes queries more complex to specify and introduces ordering dependencies among the query's (possibly many) NULL tests and path predicates.
Thus there is a need in the art for improved methods of handling NULL pointers in SQL queries over object-oriented data.