1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to the optimization of SQL queries in a relational database management system where a null value is to be matched with a null value.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
In RDBMS software all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host language, such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator operates on either one or two tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together to perform complex sets of procedures with a single statement. One of the most common SQL queries executed by RDBMS software is the SELECT statement. In the SQL standard, the SELECT statement generally has the format: “SELECT <clause> FROM <clause > WHERE <clause > GROUP BY <clause > HAVING <clause > ORDER BY <clause>.” The clauses generally must follow this sequence. Only the SELECT and FROM clauses are required and all other clauses are optional.
Generally, the result of a SELECT statement is a subset of data retrieved by the RDBMS software from one or more existing tables stored in the relational database, wherein the FROM clause identifies the name of the table or tables from which data is being selected. The subset of data is treated as a new table, termed the result table. The WHERE clause determines which rows should be returned in the result table. Generally, the WHERE clause contains a search condition that must be satisfied by each row returned in the result table. The rows that meet the search condition form an intermediate set, which is then processed further according to specifications in the SELECT clause.
The search condition typically comprises one or more predicates, each of which specifies a comparison between two values from certain columns, constants or correlated values. Multiple predicates in the WHERE clause are typically connected by Boolean operators.
The SELECT statement may also include a grouping function indicated by the GROUP BY clause. The GROUP BY clause causes the rows in the intermediate result set to be grouped according to the values specified in the clause. A number of column or aggregate functions are also built into SQL, such as MAX (maximum value in column), MIN (minimum value in column), AVG (average of values in column), SUM (sum of values in column), and COUNT (number of rows).
Queries using aggregate expressions return as many result rows as there exist unique “result groups” in the source of the aggregation. A result group is defined as the set of values contained in the fields of a row corresponding to the list of columns or expressions specified in the GROUP BY clause of the query. The value of the aggregate function is the result of applying the function to the result of the contained expression for each row having the same result group.
Another operation permitted by SQL is the JOIN operation, which concatenates horizontally all or parts of two or more tables to create a new resulting table. The JOIN operation is implied by naming more than one table in the FROM clause of a SELECT statement.
An SQL query generally includes at least one predicate, which is an SQL expression that can assume a logical value of TRUE, FALSE, or UNKNOWN. A predicate typically either specifies a data range, tests for an existence condition, tests for equivalence, or performs a similar table comparison operation.
In RDBMSs, columns of any type can assume NULL (i.e., unknown) values. In RDBMS software, NULL values are properly handled using tri-value logic (i.e., TRUE, FALSE or UNKNOWN) for predicates, and SQL-based RDBMSs employ such logic.
Nested SQL statements may require tuple-by-tuple data manipulation in each subquery for evaluation of the complete statement. For example, each entry of a table column may need to be compared against each entry of a corresponding column in another table to determine if a SELECT operation should retrieve a table row. Such tuple-by-tuple operations are very inefficient and require simplification and optimization.
In an RDBMS system, users submit SQL queries. The optimizer in the RDBMS optimizes the SQL query and generates an efficient execution plan. Typically the optimizer adopts a cost-based approach wherein the optimizer generates many possible alternative execution plans, and selects the best/cheapest plan among those generated plans.
The present invention optimizes SQL queries where a NULL value is to be matched with another NULL value. Introduction of NULL values in SQL is an important concept and introduces many optimization challenges and difficulties. There has been substantial research that deals with the presence of null values, such as shown in the following publications.
W. Kim. “On Optimizing an SQL-like Nested Query.” ACM TODS, Sep. 7, 1982. In this paper the author attempted to decorrelate correlated subquery with aggregation by replacing quantified subqueries with joins. However, the algorithms did not account for null values known as “count” bug. The research paper highlights the difficulties of handling and optimizing queries in presence of nulls.                In U.S. Pat. No. 5,590,324 “Optimization of SQL queries using universal quantifiers, set intersection, and max/min aggregation in the presence of nullable columns” the authors presented methods of query optimization involving nullable columns.        
In one example, an INTERSECT-DISTINCT query can be converted into a join if at least one of the pair-wise columns (where corresponding columns from two tables are paired) is not nullable:                CREATE TABLE F1 (T1 INT NOT NULLABLE, T2 INT, T3 INT)        CREATE TABLE F2 (T1 INT, T2 INT NOT NULLABLE, T3 INT)        
The following query:
SELECT T1, T2 FROM F1                INTERSECT DISTINCT        SELECT T1, T2 FROM F2can be converted into the join query as follows:        SELECT DISTINCT F1.T1, F1.T2        FROM F1, F2        WHERE F1.T1=F2.T1 AND F1.T2=F2.T2        
In another example quantified subqueries (e.g., using universal quantifiers such as ALL) can be converted into a join with proper handling of null values and empty result from the subquery. Since nullability can be propagated from subqueries or derived tables, this nullability propagation can enable other query transformation and optimization.
Some other publications include the exploitation of situations when a predicate becomes FALSE in the presence of null values. One example is given by the following left outer join query:                SELECT F1.T1, F2.T2        FROM F1 LEFT JOIN F2        ON F1.T1=F2.T1        WHERE F2.T3>0        
By the definition of the left outer join, each row of table F1 will be matched against the table F2 using the join condition “F1.T1=F2.T1”. If there is a match, the matching pair will be returned. If there is no match, the F1 row will be returned together with null values in F2 columns. In the situation where the column field F2.T3 contains a null value, the predicate “F2.T3>0” is FALSE. In this case the padded null rows resulted from the application of the left outer join will be filtered out by the WHERE clause, and hence, the left outer join can be converted into an inner join:                SELECT F1.T1, F2.T2        FROM F1, F2        WHERE F2.T3>0 AND F1.T1=F2.T1        
According to the ANSI SQL standard definition of a NULL value, given ajoin predicate in Boolean factor:                F1.T1=F2.T1the predicate is evaluated as FALSE when either join column is NULL. However, apparently no query optimization and performance research has dealt with the situation where a NULL value is to be matched with another NULL value.        
The problem is best understood in the context of an example. The following tables and table content are created as:                CREATE TABLE T (T1 INT, T2 INT, T3 INT)        INSERT INTO T VALUES (NULL, 2, 3)        CREATE TABLE S (S1 INT, S2 INT, S3 INT)        INSERT INTO S VALUES (NULL, 4, 6)        
In this case both tables have a row where the first column has a NULL value. Using the above database schema, the following query is considered:
(Q1)SELECT*FROMT, SWHERET1=S1
According to the ANSI SQL semantics, the join query will not return any rows because the column T1 has a NULL value (and S1 column has a NULL value) and the comparison with a NULL value results in FALSE. However, there are times that users may want to match a null value with another null value. To achieve this semantics in SQL, users must write the following query:
(Q2)SELECT*FROMT, SWHERET1=S1 OR (T1 IS NULL AND S1 IS NULL)
This query will produce a row matching of a NULL value against a NULL value because the predicate “T1 IS NULL AND S1 IS NULL” is TRUE in the OR predicate. Thus, the row has the following values in its columns:
S1S2S3T1T2T3NULL23NULL46
However, the execution of the join predicate of this query is fairly complex, when it is given in the form:                T1=S1 OR (T1 IS NULL AND S1 IS NULL)        
Unlike an ordinary join predicate which has a simple form of “LHS=RHS”, the above-mentioned complex join predicate generally requires the use of the nested loop join method, because other join methods, such as hash join and sort-merge join, typically require the predicate in the form of a simple equality predicate (“T1=S1”) for the value matching, ordering and sorting. Therefore, such complex predicates restrict the DBMS optimizer because the number of alternative evaluation strategies is very limited. Typically, only the table scan and nested loop joins are the allowed operations.
Moreover, there are other examples which have a limited optimization success. One such example is a query that uses a complex correlation predicate instead of a join predicate:
(Q3)UPDATETSETT2 =(SELECTMAX(S2)FROMSWHERET1=S1 OR (T1 IS NULLAND S1 IS NULL))
In the query Q3, it is required to find the maximal value of column S2 for each value of T1, even if both T1 and S1 are NULL values, and to use this maximal value for updating the column T2. Using a conventional optimizer, the table T will be scanned, and for each row of the table T the column T1 value will be used to find a match in the table S, where the matching condition is “T1=S1 OR (T1 IS NULL AND S1 IS NULL)”. Like the join query Q2 shown above, the number of alternative access plans is often restricted due to the complexity of the correlation predicate. In this particular example, a table scan of table S has to be performed for every correlation value T1, resulting in a very poor performance.
There is further need in the art for optimization techniques of queries where a NULL value is to be matched with another NULL value.