The present invention relates to computer database systems and more particularly to processing queries that define an outer join operation on a view, where the view is produced by joining two database objects.
Relational databases store information in collections of tables, in which each table is organized into rows and columns. FIG. 4(a) illustrates an exemplary database containing two tables, a department table xe2x80x9cDEPTxe2x80x9d 400 and a employee table xe2x80x9cEMPxe2x80x9d 410, useful for recording and organizing information about a company. The columns of the department table 400 hold attributes for the different departments of the company, including a department number xe2x80x9cDEPTNOxe2x80x9d 402, a department name xe2x80x9cDNAMExe2x80x9d 404, and a location xe2x80x9cLOCATIONxe2x80x9d 406. Each department is stored in a row. For example, row 408 is an entry for department 11, which is an accounting department and located in Washington, D.C. The employee table 410 holds information in columns for each employee in the company. Such information may include, for example, an employee""s name xe2x80x9cENAMExe2x80x9d 412, job title xe2x80x9cJOBNAMExe2x80x9d 414, and department xe2x80x9cDEPTNOxe2x80x9d 416.
A database user retrieves information from the tables of a relational database by entering input that is converted to queries by a database application, which submits the queries to a database server. In response to receiving a query, the database server accesses the tables specified in the query to determine which information within the tables satisfies the query. The information that satisfies the queries is then retrieved by the database server and transmitted to the database application and ultimately presented to the user. Database servers are also capable of combining or xe2x80x9caggregatingxe2x80x9d information contained in the tables in response to a query. For example, one query for the exemplary database is to list all the departments in Washington, D.C., and the departments"" employees, if any, with their job titles.
For any given database application, the queries must conform to the rules of a particular query language. Most query languages provide users with a variety of ways to specify information to be retrieved. For example, in the Structured Query Language (SQL), the following query in the ANSI SQL syntax requests the retrieval all the departments in Washington, D.C., and the departments"" employees, if any, with their job titles:
[QUERY 1]
select DNAME, ENAME, JOBNAME
from DEPT left outer join EMP
on DEPT.DEPTNO=EMP.DEPTNO and DEPT.LOCATION=xe2x80x98DCxe2x80x99;
This query performs an xe2x80x9couter joinxe2x80x9d operation on the department table 400 and the employee table 402. An outer join is a species of a xe2x80x9cjoinxe2x80x9d operation that combines rows from two or more relational database objects, such as tables, views, or snapshots. A join is performed whenever multiple tables appear in the FROM clause of query. The SELECT list of the query can reference any of the columns from any of the base objects listed in the FROM clause.
Most join queries contain a WHERE or ON clause that contains a predicate that compares two columns, each from a different joined object. Such predicates are referred to join conditions. In addition to the join conditions, the WHERE or ON clause can also contain other conditions that refer to columns of only one of the joined objects for further restricting the rows returned by the join operation. To process a join, a database server combines pairs of rows that satisfy the join conditions and the other predicates. Rows that combine in a manner that satisfy the join conditions and other predicates are referred to herein as combining rows.
For an xe2x80x9cinner join,xe2x80x9d only rows formed by combining rows are reflected in the result. In contrast, an outer join returns all combining rows plus all rows from one of the joined objects for which no rows from the other joined object satisfy the join conditions (i.e. the non-combining rows). A xe2x80x9cleft outer joinxe2x80x9d is a outer join in which the noncombining rows from the left listed object in the FROM clause are brought into the result; a xe2x80x9cright outer joinxe2x80x9d is an outer join in which the non-combining rows from the right listed object in the FROM clause are brought into the result; and a xe2x80x9cfull outer joinxe2x80x9d is an outer join in which the non-combining rows from both the right and left listed objects in the FROM clause are brought into the result.
As illustrated in FIG. 4(b), the results 420 of processing QUERY 1 includes one or more rows for each of the departments in Washington, D.C. As in any join operation, the outer join results include the rows that satisfy the join condition of correlated department numbers (rows 424 and 426). For the left outer join operation, another row 422 of the results 420 is derived from the left object, department table 400, because there is no corresponding row in the employee table 410 that matches the join condition for row 408 of the department table 400.
Techniques exist for efficiently processing an outer join on two tables. For example, a database server may recognize that the employee table 410 has an index built on the xe2x80x9cDEPTNOxe2x80x9d column 416. A database index is conceptually similar to a normal index found at the end of a book, in that both kinds of indexes comprise an ordered list of information accompanied with the location of the information. Specifically, a database index contains ordered values from one or more columns of a table and a list of which rows in the table contain those values. Indexes can provide performance benefits over full table scans if the predicates are selective, because the index renders it no longer necessary to scan the entire table to find the rows that contain particular column values.
The performance benefits of an index are particularly apparent when a predicate in the WHERE clause of QUERY 1 causes only a few rows to be selected from the employee table 410. In the example, a WHERE predicate specifying that the location is xe2x80x98DCxe2x80x99 is selective, because it causes only the department numbers of 11 and 31 to be looked up in the employee table 410. On the other hand, if the predicates are nonselective, full table scans are preferred over index scans because full table scan can exploit the efficiencies of block input/output operations to read data from the table.
Over time, it is not uncommon for database administrator to reorganize their tables as conditions warrant. For example, after the employee table 410 was created, which explicitly included the job title in column xe2x80x9cJOBxe2x80x9d 414, a database administrator may desire to reorganize the employee table 410. Referring to FIG. 6(c) , the database administrator may wish to create a new job table 440 to list the job title xe2x80x9cJOBNAMExe2x80x9d 444 and pay grade xe2x80x9cPAYGRADExe2x80x9d 446 of each job in the company. Jobs in the job table 440 are identified by a job number xe2x80x9cJOBNOxe2x80x9d 442. The employee table 410 is modified to produce a new employee table 430 that includes employee name xe2x80x9cENAMExe2x80x9d 432 and department number xe2x80x9cDEPTNOxe2x80x9d 436 columns as before, but with job number xe2x80x9cJOBNOxe2x80x9d column 434 as foreign key value to reference the appropriate entry in job table 440. Creation of the job table 440 is useful for pooling all the appropriate information about the company""s job into a database object, without denornalization of the other tables.
An upward compatibility issue with this proposed database reorganization is that there may be many queries for many database applications that still reference the xe2x80x9cJOBNAMExe2x80x9d column 414, that was moved into the new job table 440. For some database environments, updating all the queries may require many staff-years to complete. Accordingly, database administrators prefer to create a xe2x80x9cviewxe2x80x9d that looks to the database applications like the old employee table 420 but is really formed as a join of the new employee table 430 and the new job table 440.
A view is a logical table, and as logical tables, views can be queried just as if they were tables. The data that views actually present, however, is extracted or derived from other database objects, which may in fact be tables, other views, or snapshots. A view is defined by metadata referred to as a view definition, which is typically in the form of a database query. For example, a view definition to create a view xe2x80x9cEMPxe2x80x9d that looks like the old employee table xe2x80x9cEMPxe2x80x9d 420 and based on the new employee table xe2x80x9cEMPxe2x80x9d 430 and the new job table xe2x80x9cJOBxe2x80x9d 440 may be issued as follows:
[QUERY 2]
create view EMP as
select ENAME, JOBNAME, DEPTNO
from NEWEMP, JOB
where NEWEMP.JOBNO=JOB.JOBNO;
Since QUERY 2 defines a view xe2x80x9cEMPxe2x80x9d, QUERY 1 is said to contain an xe2x80x9couter joined viewxe2x80x9d because QUERY 1 contains a view xe2x80x9cEMPxe2x80x9d that is subjected to an outer join operation. In QUERY 2, the exemplary view is a join of two tables, although the view may conceivably join other database objects, such as other views or snapshots.
A disadvantage with outer joined views is that conventional database servers inefficiently process outer joined views. Since the outer join operation is not commutative, the database server cannot freely re-order the joins and. outer join during join planning. This restriction may limit the database server from finding an efficient join plan. For example, in QUERY 1 used with the view of QUERY 2, the database server cannot first outer join the NEWEMP table 430 to the DEPT table 400 and then inner join the result to the JOB table 440, because the result is not semantically equivalent. In the example, the inner join, which was performed second, causes row 422 of the correct result to be dropped, because a null row from the NEWEMP table 430 does not satisfy the inner join conditions.
Thus, conventional database servers evaluate the view first, i.e., inner join the NEWEMP table 430 and the JOB table 440. When the join conditions of the view are non-selective, as typical in this example, evaluating the view first causes a full table scan of both tables. In contrast with the original outer join on tables, the condition in the outer join (e.g. LOCATION=xe2x80x98DCxe2x80x99) does not reduce a portion of the NEWEMP table 430 that participates in the overall query, and any index on the DEPTNO column 436 of the NEWEMP table 430 cannot be used during or after the view evaluation. This access path is clearly less efficient than the original outer join query. When the view is very large, this access path is significantly less efficient.
Accordingly, there is a need for efficiently processing queries with outer joined views. A need also exists for a processing outer join queries on a view that allows index scans to be performed when more efficient than performing full table scans. These and other needs are addressed by a query transformation that pushes an outer join predicate into the view. When evaluating the view, the pushed outer join predicate can be considered during join planning and, if sufficiently selective, be used to reduce the portion of the tables joined in the view for scanning and avoid a full table scan.
One aspect of the invention is a computer-implemented method and computer-readable medium bearing instructions for processing a query in a database system. The methodology includes receiving an original query that contains an outer join operation of a view formed of a database object; transforming the original query into a transformed query by transferring a predicate for the outer join operation into the view; and then processing the transformed query.
Preferably, costs of the query before and after the query transformation are estimated to determine whether the transformation was appropriate for a particular query. Accordingly, another aspect of the invention is a computer-implemented method and computer-readable medium bearing instructions for processing a query in a database system. The methodology includes receiving an original query that contains an outer join operation of a view, where the view is formed of a join of a database object. A first cost for processing the original query is estimated. The original query is transformed into a transformed query by transferring a predicate for the outer join operation into the join for the view. A second cost estimate is made for processing the transformed query. The first and second costs are compared. If the first cost is less than the second cost, then the original query is processed; but if the second cost is less than the first cost, then the transformed query is processed.
Still other needs addressed and advantages attained by the present invention will become readily apparent from the following detailed description, simply by way of illustration of the best mode contemplated of carrying out the invention. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the invention. Accordingly, the drawing and description are to be regarded as illustrative in nature, and not as restrictive.