A database management system (DBMS), as used herein, is the combination of an appropriate computer, including direct access storage devices (DASD) or disk drives, and database management software. A relational DBMS (RDBMS) is a DBMS which uses relational techniques for storing and retrieving data. Relational databases are computerized information storage and retrieval systems in which data in the form of tables is typically stored on disk drives or DASD for semi-permanent storage. A tattle is composed of rows which are formally called tuples. Each tuple has one or more columns. A tuple expresses a mathematical relation. A very of a table can be a set of its columns or the tuple identifier (TID) of its tuples.
The DBMS is structured to accept commands to store, retrieve, and delete data. One widely used and well known set of commands is called the Structured Query Language (SQL). The term query is used to refer-to a set of commands to retrieve data from the DBMS. The definitions for SQL provide that a DBMS should respond to a particular query with a particular set of data given a specified database content, but the method that the DBMS uses to actually find the required information in the tables on the disk drives is left up to the DBMS. Typically there will be more than one method that can be used by the DBMS to access the required data. Because the tables can be very large, as in the case of a bank's accounts database for example, it is important to optimize the method used to find the data requested in a query in order to minimize the computer time used and, therefore, the cost of doing the query.
A simple example of a query is:
______________________________________ SELECT COLUMNX FROM TABLE1 WHERE COLUMNY = `London` ______________________________________
This query requests that the DBMS retrieve all COLUMNX fields from tuples in TABLE 1 which have COLUMNY equal to `London`. In practical applications the query can become very complex. Multiple tables and multiple columns can be referenced. In order to distinguish which column of which table is being referenced COLUMNX of TABLE 1 may be written as TABLE 1.COLUMNX.
One of the most powerful features of SQL is the capability of nesting SQL query expressions within the predicate in the WHERE clause. Nested SQL queries are called subqueries. With subqueries, one can compare the column expression of a query to the column expression of another query. One can also compare column expressions with subqueries whose result is a table, either by testing set membership, testing if ANY row if the table expression has a property, or testing if ALL do. It is often quite natural to formulate a query using subquery in the predicate For example, to find all the employees who earn less than their manager, one may write
______________________________________ EXAMPLE 1: Relation: EMP (NAME, EMPNO, MGRNO, SALARY) SELECT * FROM EMP EMPLOYEE WHERE SALARY &gt; (SELECT SALARY FROM EMP MANAGER WHERE MANAGER.EMPNO= EMPLOYEE.MGRNO) ______________________________________
Subqueries are often used in SQL update, delete and insert statements. For example, to delete all the accounts from the MASTER table which have `D` entries in the REVISION table, one may, write:
______________________________________ EXAMPLE 2: Relations: MASTER (ACCNO, AMOUNT, . . . ) REVISION (ACCNO, TYPE, . . . ) DELETE FROM MASTER WHERE ACCNO IS IN (SELECT ACCNO FROM REVISION WHERE TYPE=`D`) ______________________________________
Views, as defined in Database Language SQL2 (ISO/ANSI Working Draft: Database Language SQL2 and SQL3:X3 H2/90/398 ISO/IEC JTC1/SC21/WG3, 1990, are SQL structures which provide a means for naming queries. Views can be used to make applications more independent from the names and format of the underlying objects (columns and tables). Authorization is associated with views so that views cart be used as a protection mechanism to restrict access to a table horizontally or vertically via projection or restriction. Views may also make use of joins and other SQL capabilities. In view transformations it is usually more efficient to merge a view with the query using it. This is sometimes called composition, as opposed to materialization. Subqueries are analogous to views which also can be used as subparts of other queries.
Table expressions, as defined in Database Language SQL2, specify tables. That is, table expressions are expressions whose values are tables. For example, views are table expression. Table expressions cart be specified using the SQL "FROM" clause wherein the derived table and columns can be optionally named. It is desirable to use a table expression instead of materializing a temporary table/view, which one has to create, populate with the right tuples, and then drop.
There is a need when processing queries for selecting a method of processing the subquery which will give an optimized performance. A database user needs to be able to retrieve data if a reasonable period of time. When database are large, the time taken to retrieve data cart be critical to the operability of the database. There are many ways in which a query can be processed so that it can be carried out in a minimal amount of time. The order in which the query is processed and data retrieved effects the overall time for retrieving the data. Some areas of query processing optimization such as the joins which are used to merge tables for retrieving the requested data have been separately researched for optimal implementation.
In Example 2, it is necessary to know which tuples in the MASTER table have been marked for deletion in the REVISION table. A commonly used method of processing queries containing a subquery is to evaluate the subquery for each row in the outer query. In Example 2 this approach would result in evaluating the subquery for every row in the MASTER table which could be quite large.
A better approach known to those in skilled in the art is implemented in the article "System R" (M. Astrahah et al, "System R: Relational Approach to Database Management" ACM Transactions on Database Systems, June 1986, 97-137. The System R article provides that where a subquery is non-correlated, i.e., not defined from one column to another column, the subquery is evaluated once and the result is saved for comparison to the outer query. (see P. Selinger, et al., in "Access Path Selection in a Relational Database Management System", June 1979, Procs. ACM SIGMOD). However, the System R processing scheme is inefficient when either:
1. Many rows in the outer table have the same value for the columns which are compared to the columns of the inner table of the subquery. In Example 1, a manager may have many employees, so many rows in EMP have the same MGRNO. In this case, the manager's salary is retrieved for every employee row. Many redundant evaluations of the subquery are performed due to inefficient duplicate processing. PA1 OR PA1 2. The outer table is very large, the inner table of the subquery after restriction is very small, and only a few rows in the outer table satisfy the subquery. In Example 2, the MASTER table is very large but only a few rows are to be deleted. Scanning every row in a large table in order to delete a few rows that satisfy the subquery is an inefficient way of subquery processing. PA1 A Different join sequence can be considered (solving the Example 2 problem), PA1 Different join algorithms can be employed (solving the Example 1 problem), PA1 For the inner table, synchronous I/O is avoided and set processing is permitted. PA1 A new join column (which may be sargable and indexable) is added. PA1 Transitive closure can be applied to the new added join column. PA1 The subquery table can be accessed directly in the join. PA1 Correlation is eliminated.
The query processing method of evaluating subquery for each row of a query is very inefficient since it entails row-at-a-time; operation. Efficient sequential access of the inner table of the subquery cannot be exploited. Options for parallel processing in the subquery are extremely limited. A more efficient way of processing the subquery is to merge the subquery with the outer query so that it is handled as a join. By transforming the subquery into a join:
As can be appreciated from the foregoing, it is desirable to transform an occurence of a subquery in a query into a join operation. Currently available commercial relational database products apply subquery to join transformation in limited situations; such as, when a unique index exists on the selected columns in the subquery. Current optimizers do not perform subquery transformation for UPDATE/DELETE statements and the cursor SELECT statements that would be updated or deleted via WHERE CURRENT OF statements. Furthermore the known query optimizers do not perform any optimization for scalar subqueries with simple SELECTs.
R. Ganski, and E. Wong, "Optimization of Nested SQL Queries Revisited", 1987, Procs. of ACM SIGMOD Conference; W. Kim, "On Optimizing an SQL-like Nested Query", September 1982, vol 7:3, ACM Transactions on Database Systems; and O. Anfindsen, "A Study of Access Path Selection in DB2" Norwegian Telecommunications Administration and University or Oslo, Norway, all disclose optimizers which are limited in terms of the amount of merging and the subquery to join transformations that are to be performed. Therefore these optimizers do not minimize as thoroughly as possible the time taken to retrieve data from a relational database.
There are significant occurences of queries, where optimization could be performed for the queries, subqueries, views and table expressions, but, the current computer based query optimizers are not optimizing these queries.
Databases can be quite large and it is desirable that the time taken to retrieve data from the database be as short as possible. Therefore, there is a need to be able to identify when merges of subqueries can be done without compromising the integrity of the result in terms of the duplicates that are generated or eliminated. There is likewise a need to perform merges as much as possible, so that the query can be processed more efficiently. An optimization process which performs merges as much as possible has the result of improving the speed by which data is retrieved for the database when requested by a database management system user by means of a query.