1. Technical Field
The present invention relates generally to database management systems, and more particularly to query optimization in a database system. The present invention specifically relates to un-nesting nested database queries by distinguishing between joining and anti-joining tuples during an outer join operation.
2. Background of the Invention
A database management system is a computer system that provides for the storage and retrieval of information about a subject domain. Typical examples are airline reservation systems, payroll systems, and inventory systems. The database management system includes an organized collection of data about the subject domain, and a data-manipulation language for querying and altering the data. Typically, the data are organized as "tuples" of respective values for predefined attributes.
In a so-called "relational" database management system, the tuples of data are stored in a plurality of tables, each of which corresponds to a set of tuples for common predefined attributes. Each table corresponds to a single "relation". Each attribute of the relation corresponds to a column of the table, and each tuple of values corresponds to a particular row of the table. Each value in the table corresponds to a particular row or tuple and a particular column or attribute. A collection of related tables or relations in the database is known as a schema.
The data manipulation language for a relational database management system typically specifies operations upon one or more relations to create a new relation. A "restriction" operation forms a subset of the rows in a table by applying a condition or "predicate" to the values in each row. A "projection" operation removes columns from the rows by forming a stream of rows with only specified columns. A "join" operation combines data from a first table with data from a second table, typically by applying a predicate comparing the values in a column of the first table to the values in a related column of the second table. Usually such a join predicate is an "equi-join" predicate in which a first function of the attribute corresponding to the column of the first table must equal a second function of the attribute corresponding to the column of the second table.
Yet another operation upon a relation is known as an aggregation, in which a new column of values is generated by combining or aggregating values from all of the rows, or specified groups of rows. Aggregate functions include, for example, a count of the rows in each group, or the sum of the values of a specified attribute for all rows in each group. Typically, the rows are grouped for aggregation such that the rows in each group have equal values for a specified column or attribute. Therefore, a new aggregate value is generated corresponding to each distinct value of the specified attribute.
The query language for a relational database system typically defines a syntax for specifying a "query block" including a list of relations to be accessed, a predicate to govern restriction or join operations, a list of attributes to specify a projection operation, and a list of aggregate functions. If at least one attribute is specified and at least one aggregate function is also specified, then a meaningful result usually would require the aggregate to be grouped for each distinct value of the specified attribute. Although such a grouping could be presumed, the query language may permit or require grouping in this situation to be specified by a list of grouping attributes. In this case, the rows are grouped prior to aggregation such that each group corresponds to each distinct combination of values for all of the specified grouping attributes.
Some query languages permit query blocks to be nested such that the predicate of an "outer" query block includes reference to the result of an "inner" query block. In this case, the query language specifies a result that would be obtained by re-evaluating the inner query block each time the outer query block evaluates the predicate for a different row or combination of rows when performing its specified restriction or join operations. Evaluating the result in such a fashion, however, usually is very inefficient.
To select the most efficient of alternative ways of evaluating a query, the database management system typically includes a query optimizer. In general, a query can be evaluated a number of ways, because, in many cases, the query operations obey certain commutative, associative, or distributive laws. By applying these laws, the query optimizer may formulate alternative orders of performing the query operations, compute a cost of performing each such "query plan", and select the least costly query plan for execution.
Unfortunately, if a query is specified by nested query blocks, the nesting itself specifies an iterative order of performing query operations. Since the query operations of an inner block are specified to occur during the restriction or join operation of the outer query block, the query optimizer cannot apply the laws permitting alternative execution orders unless the nested query blocks are first "un-nested" into equivalent "pipelined" query blocks.
When query blocks are pipelined, the result of a first query block is specified as input to the predicate of a second query block, but that result is presumed to be evaluated only once before evaluation of the second query block. Therefore, there is a distinct order of performance specified between the first query block and the second query block. In many cases the un-nesting process generates un-nested query blocks in which this specified order is more efficient that the iterative method of executing the original nested query blocks. In other cases, the optimizer can find an even better order of execution for evaluating the un-nested query blocks.
The problem of un-nesting and optimizing database queries has been studied extensively. A general solution was proposed in Won Kim, "On Optimizing an SQL-like Nested Query", ACM Transactions on Database Systems, Vol. 9, No. 3, American Association for Computing Machinery, United States (1982), incorporated herein by reference. Later, it was discovered that the un-nesting technique of Kim does not always yield the correct results for nested queries that have non equi-join correlation predicates or that have a "COUNT" aggregate between the nested blocks. Un-nesting solutions for these anomalous cases were described in Richard A. Ganski and Harry K. T. Wong, "Optimization of Nested SQL Queries Revisited", Proceedings of the Sigmod Conference, American Association for Computing Machinery, United States (1987), pp. 23-33, incorporated herein by reference. Ganski's method of un-nesting was extended to multiple nested blocks as disclosed in Umeshwar Dayal, "0f Nests and Trees: A Unified Approach to Processing Queries That Contain Nested Subqueries, Aggregates and Quantifiers", Proceedings of the 13 VLDB Conference, Brighton, 1987, pp. 197-208, incorporated herein by reference. Methods of un-nesting multiple blocks within the same block are disclosed in M. Muralikrishna, "Optimization and Dataflow Algorithms for Nested Tree Queries", Proc. VLDB Conf. (August 1989), pp. 77-85, incorporated herein by reference. The commuting of joins and outer joins under specific conditions is disclosed in Arnon Rosenthal and Cesar Galindo-Legaria, "Query Graphs, Implementing Trees, and Freely-Reorderable Outerjoins", Proc. SIGMOD Conf (May 1990), pp. 291- 299, incorporated herein by reference.