1. Field of the Invention
This invention relates generally to systems for automatic query optimization and execution in relational database management systems (RDBMS) and, particularly to a producer-driven, shared-nothing, parallel execution RDBMS for restricting execution plans during query merger and optimization to eliminate all plans containing mutating table violations of database integrity.
2. Description of the Related Art
A relational database management system (RDBMS) is a computer database management system that uses relational techniques for storing and retrieving data. Relational databases are computerized information storage and retrieval systems in which data in the form of tables (formally denominated "relations") are typically stored for use on disk drives or similar mass data stores. A "table" includes a set of rows (formally denominated "tuples" or "records") spanning several columns (formally denominated "attributes"). Each column in a table includes "restrictions" on the data contents thereof and may be designated as a primary or foreign key. Reference is made to C. J. Date, An Introduction to Database Systems, 6th edition, Addison-Wesley Publishing Co. Reading, Mass. (1994) for an comprehensive general treatment of the relational database art.
A shared-nothing relational database system (SN-RDBMS) is a database system in which a table's data set is typically partitioned over a group of logical processing units (LPUs) each comprising a central processing unit and some logical unit of disk storage space. Each LPU in a LPU group is managed by a separate executing instance of the SN-RDBMS. A SN-RDBMS environment thus extends across plural LPU groups each comprising plural LPUs.
An RDBMS is structured to accept commands to store, retrieve and delete data using high-level query languages such as the Structured Query Language (SQL). The term "query" denominates a set of commands for retrieving data from a stored database. The SQL standard has been promulgated by the International Standards Association since 1986. Reference is made to the SQL-92 standard "Database Language SQL" published by the ANSI as ANSI X3.135-1992 and published by the ISO as ISO/IEC 9075:1992 for the official specification of the 1992 version of the Structured Query Language. Reference is also made to James R. Groff et al. (LAN Times Guide to SQL, Osborne McGraw-Hill, Berkeley, Calif., 1994) for a lucid treatment of SQL-92.
With a SN-RDBMS, a scan on a table T requires that the scan logic be distributed to all LPUs that hold a partition on T. When an application issues a "SELECT * from T" SQL statement, the statement is processed in the context of the LPU to which the application is connected, known as the coordinator. The scan logic is distributed to all the relevant LPUs, and each returns its LPU data to the coordinator. The coordinator in turn collects and returns the data to the application.
A producer-driven system is symbolized by a data-flow machine in which information flows from producers, which provide results when produced so that the results are available for any consumer requesting them, whether or not any consumer is ready for them. In producer-driven systems, results are buffered or queued. An SN-RDBMS is an instance of a producer-driven system in which results are queued in structures called "table queues" (TQS).
As used herein, a "query" refers to a set of user commands for retrieving data from the stored database. The query language requires the return of a particular data set in response to a particular query but the method of query execution ("Query Execution Plan") employed by the RDBMS is not specified by the query. There are typically many different useful execution plans for any particular query, each of which returns the required data set. For large databases, the execution plan selected by the RDBMS to execute a query must provide the required data return at a reasonable cost in time and hardware resources. Most RDBMSs include a query optimizer to translate queries into an efficiently executable plan. According to the above-cited Date reference, the overall optimization process includes four broad stages. These are (1) casting the user query into some internal representation, (2) converting to canonical form, (3) choosing prospective implementation procedures, and (4) generating executable plans and choosing the cheapest of said plans.
For example, FIG. 1 shows a query translation process known in the art. Queries written in SQL are processed in the phases shown, beginning with lexing at step 13, parsing and semantic checking at step 14, and conversion to an internal representation denoted the Query Graph Model (QGM) 15, which is a command data-structure that summarizes the semantic relationships of the query for use by the query translator and optimizer components. A query global semantics (QGS) process 17 adds constraints and triggers to QGM 15. A QGM optimization procedure 16 then rewrites the query into canonical form at the QGM level by iteratively "rewriting" one QGM into another semantically equivalent QGM 15. Reference is made to U.S. Pat. No. 5,367,675 issued to Cheng et al., entirely incorporated herein by this reference, for a discussion of a useful QGM rewrite technique that merges subqueries. Also, reference is made to U.S. Pat. No. 5,276,870 wherein Shan et al. describe a QGM optimization technique that introduces a "view" node function to the QGM to permit base table references to "VIEWs" by other nodes. This conditions the QGM to permit the execution plan optimizer 18 to treat a view like a table.
QGM optimization procedure 16 rewrites QGM 15 to simplify the subsequent plan optimization process 18, which produces Query Execution Plans (QEPs). Plan optimization procedure 18 generates alternative QEPs and uses the best QEP 20 based on estimated execution costs. The plan refinement procedure 22 transforms optimum QEP 20 by adding information necessary at run-time to make QEP 20 suitable for efficient execution. Importantly, the QGM optimization step 16 is separate and distinct from the QEP optimization in step 18. Reference is made to U.S. Pat. No. 5,345,585 issued to Iyer et al., entirely incorporated herein by this reference, for a discussion of a useful join optimization method suitable for use in QEP optimization step 18. Reference is made to U.S. Pat. No. 5,301,317 issued to Lohman et al., entirely incorporated herein by the reference, for a description of an adaptive QEP optimization procedure suitable for step 18.
QGM 15 used in the Query Rewrite step 16 can be understood with reference to Pirahesh et al. ("Extensible/Rule-Based Query Rewrite Optimization in Starburst", Proc. ACM-SIGMOD Intl. Conf. on Management of Data, San Diego, Calif., pp. 39-48, June 1992). A useful QGM known in the art is now described in detail. FIG. 2 provides a QGM graphical representation of the following SQL query:
SELECT DISTINCT Q1.PARTNO, Q1.DESCR, Q2.PRICE PA1 FROM INVENTORY Q1, QUOTATIONS Q2 PA1 INSERT INTO DEPARTMENT PA1 UPDATE EMPLOYEE SET EMPNO=EMPNO+1 PA1 DECLARE CURSOR A PA1 OPEN A PA1 DO PA1 DELETE FROM DEPARTMENT PA1 CREATE TRIGGER NEWORDER PA1 AS UPDATE SALESREPS PA1 UPDATE PRODUCTS
WHERE Q1.PARTNO=Q2.PARTNO PA2 SELECT * FROM SIMILARDEPARTMENT PA2 WHERE EMPNO=&lt;5 PA2 SELECT * FROM DEPARTMENT PA2 FETCH A PA2 DELETE FROM DEPARTMENT WHERE CURRENT OF A PA2 WHERE DEPTNO NOT IN (SELECT REPORTS FROM EMPLOYEE PA2 ON ORDERS PA2 FOR INSERT PA2 SET SALES=SALES+INSERTED.AMOUNT PA2 FROM SALESREPS. INSERTED PA2 SET QTY.sub.-- ON.sub.-- HAND=INSERTED.QTY PA2 FROM PRODUCTS. INSERTED
AND Q2.PRICE&gt;100 PA3 WHERE EMPLOYEE.REPORTS=DEPARTMENT.DEPTNO) PA3 WHERE SALESREPS.EMPL.sub.-- NUM=INSERTED.REP PA3 WHERE PRODUCTS.MFR.sub.-- ID=INSERTED.MFR PA4 AND PRODUCTS.PRODUCT.sub.-- ID=INSERTED.PRODUCT
A SELECT box 24 is shown with a body 26 and a head 28. Body 26 includes data-flow arcs 30 and 32, which are also shown as the internal vertices 34 and 36. Vertex 36 is a set-former that ranges on (reads from) the box 38, which provides records on arc 32. Similarly, vertex 34 ranges on box 40, which flows records on data-flow arc 30. The attributes to be retrieved from the query, PARTNO 46, DESC 48 and PRICE 50, are in head 28. Boxes 38 and 40 represent the base tables accessed by the query, INVENTORY 42 and QUOTATIONS 44, respectively. Box 24 embraces the operations to be performed on the query to identify the PARTNOs that match in the two base tables, as required by the join predicate 52 represented as an internal predicate edge joining vertices 34 and 36. Vertex 34 also includes a self-referencing predicate 54 to identify prices of those PARTNOs that exceed 100.
For the purposes of this invention, note that each box or node (formally denominated "quantifier node") in FIG. 2 is coupled to one or more other nodes by data-flow arcs (formally denominated "quantifier columns"). For instance, base table node 38 is coupled to select node 24 by data-flow arc 32 and base table node 40 is connected to select node 24 by data-flow arc 30. The activities inside select node 24 produce a new stream of data records that are coupled to the TOP node 56 along a data-flow arc 58. TOP node 56 represents the data output table requested by the query.
The object of several known QGM optimization procedures is to merge one or more nodes where possible by eliminating (collapsing) data-flow arcs. For instance, the above-cited Pirahesh et al. reference describes a set of rules for merging any number of nodes into a single SELECT node, with certain restrictions on non-existential or non-Boolean factor subqueries, set operators, aggregates and user-defined extension operators such as OUTER JOIN. Thus those skilled in the art know that QGM optimization step 16 usually rewrites the QGM to eliminate numerous nodes and data-flow arcs even before considering useful query execution plans in plan optimization step 18 (FIG. 1). Also, most execution plans usually pipeline data along the data-flow arcs without waiting to complete execution of a node before flowing data to the next node.
This situation is the basis of several well-known problems in the RDBMS art. These problems are known variously as database "integrity" problems, which arise from "violations" of "integrity rules" arising from column constraints, key constraints and the like. In the above-cited reference, Date uses the term "integrity" to refer to the accuracy or correctness of the data in the database. Related database processing problems include "mutating table violations" of constraints such as "relational integrity" constraints. Although mutating table violations may not affect database integrity directly, they can cause unpredictable variation in query results.
For instance, one or more columns in a base table may be assigned a "key attribute", such as "primary key" or "foreign key". The primary key of a base table uniquely identifies each record (row) in the table. A foreign key establishes a relationship between the base table and another parent table designated by the foreign key column attribute. Thus, a parent/child relationship may be established between two tables by adding a foreign-key column attribute to one and relating it to the primary-key column attribute in another, as is well-known in the art. Such relationships may be added without limit and each key attribute may span several columns.
Disadvantageously, there are four types of database updates that can corrupt the referential integrity of the parent/child relationships in a database. These include (a) inserting a new child record, (b) updating the foreign key in a child record, (c) deleting a parent record, and (d) updating the primary key in a parent record.
When a new row is inserted into the child table, its foreign key value must match one of the primary key values in the parent table, according to the column attribute. If the foreign key value does not match any primary key, inserting the row corrupts the database because there is now a child without a parent. Such an INSERT is typically rejected by the RDBMS. Inserting a row in a parent table never gives this problem because it simply becomes a parent without children. Updating the foreign key in a child record is a different form of this problem. If the foreign key is modified by an UPDATE statement, the new value must match some primary key value in the parent table to avoid "orphaning" the updated record.
Upon deletion of a parent record that has one or more children, the child records are orphaned because their foreign key values no longer match any existing primary key value in the parent table. Such violations either cause the DELETE operation to be rejected, CASCADE to delete the orphaned child, or reset the orphaned child foreign key column to NULL (or DEFAULT). Deleting a record from the child table is no problem because the parent merely loses a child. Updating the primary key in a parent record is a different form of this problem. If the primary key of a parent record is modified, all existing children of that record become orphans because their foreign key values no longer match an existing primary key value.
For each parent/child relationship created by a foreign key, SQL provides for an associated delete rule and an associated update rule. The delete rule specifies DBMS action when a user tries to delete a parent record. Available rules include RESTRICT, CASCADE, SET NULL, and SET DEFAULT. The update rule specifies one of these RDBMS actions when the user tries to update the value of one of the primary key columns in the parent table. The usual RESTRICT rule merely rejects the attempted operation. The CASCADE rule automatically deletes or updates records from the children responsive to a command to delete or update a parent record. As is well-known, a CASCADE update may propagate across many base tables, converting each into a "mutating table" at times during the referentially-triggered actions. The base table object of the DELETE may also be mutating.
Practitioners in the art have proposed several classes of solutions to this referential integrity problem. For instance, insertion of a child record may be rejected with an error message because a search of the parent table does not find the child record key. UPDATE of a primary key in a parent record may be rejected or cascaded to change all related child keys to the same value (or to a NULL or DEFAULT value). Such a requirement is herein denominated a "referential constraint". A table that is in the process of being modified responsive to an UPDATE, DELETE or INSERT statement or because of the propagating enforcement of a referential constraint is herein denominated a "mutating table." As used herein, any table that is in the process of modification by a particular relational operation is said to be a "mutating table." In general, user-specified SQL operations that attempt to read or modify a mutating table are rejected as invalid at run-time. Proper semantics require the completion of a table-mutating operation before any other operation is allowed to read or modify the table.
Until now, other than "version-logging" of a mutating database, user query restrictions were the only methods known in the art for preventing mutating table violations of integrity related to referential constraints and triggers. Restrictions on the types of legal queries generally operate by raising a compile-time or a run-time error return responsive to either a potential or actual mutating table violation. In general, the user query restrictions known in the art may be grouped as (a) restrictions on queries, (b) restrictions on the number of records modified by query, and (c) restrictions on row-level triggers.
Reference is made to the above-cited Groff et al. reference at pp. 266-285 for a detailed discussion of SQL provisions for referential integrity and triggered actions. Reference is also made to Kevin Owens et al. ("Oracle7 Triggers: The Challenge of Mutating Tables", Database Programming and Design, 7(10): 47-55, October 1994) for a detailed description of SQL trigger applications and the resulting mutating table violations of integrity. Owens et al. conclude that row-level and statement-level triggers are advantageous tools but row-level triggers can cause mutating table errors if the trigger performs a read from the same table or uses a column from the same table in a join. They recommend avoiding the problem by loading the table at the row-level and operating on it at the after-statement level to ensure completion of row-level updates before reading.
FIGS. 3A-3G show several examples of mutating table integrity problems known in the art. FIG. 3A shows a self-referencing DEPARTMENT base table and FIG. 3B shows a SIMILARDEPARTMENT base table that is identical to the DEPARTMENT table. The mutating table integrity violation occurs when attempting to execute the following query:
The result of the execution of this query depends on the sequence in which the records (rows) of SIMILAR DEPARTMENT are accessed. If row 60 (FIG. 3B) is first read, the insert to DEPARTMENT fails because the ADNO column value of 1 does not exist in the primary key column DEPTNO of DEPARTMENT. If the row 62 is first read, the insert succeeds because the NULL value of the ADNO column does not violate the parent/child relationship between the DEPTNO and ADMINDNO columns. After insertion of row 62, a later attempt to insert row 60 succeeds because the DEPTNO primary key values now include the 1 from the earlier-inserted row 62.
It is appreciated by practitioners skilled in the RDBMS art that a query execution must succeed independent of the sequence of any possible execution plan. Also, the operation of the QGM Rewrite step 16 and QEP=Optimize step 18 (FIG. 1) together prevent any preconditioning of the execution sequence of the optimal QEP produced by the RDBMS. Accordingly, because record access sequence may not be predetermined, if a self-referencing table is the object of an INSERT statement with a subquery, the RDBMS must require that the subquery not return more than one row or that the INSERT statement be rejected. In practice, this violation is not detected until run-time, after completion of all query optimization activities. To avoid processing errors, the user is obliged in the art to "code around" this restriction to obtain the desired function.
The mutating table integrity violation associated with an attempt to update the primary key value of a parent record is exemplified by FIG. 3C in connection with the following SQL query:
Because EMPNO is the primary key of the EMPLOYEE table, the result of this query depends on the sequence in which the records are read from the EMPLOYEE base table. If the rows are accessed in primary key order, the update fails because the first attempt to update record 001 to 002 duplicates the primary key of record 002, thereby violating the primary key column restriction. If the table is accessed in reverse order, the update succeeds because key 005 is replaced with key 006, key 004 is replaced with key 005 subsequently, and so forth, thereby avoiding any primary key column restriction violations. Thus, when the columns of a primary key are updated, the update statement must be restricted by the predicate to no more than one record. This violation is also usually undetected until run-time, after completion of query optimization. This problem also arises in a cursor-controlled update ("positioned" UPDATE) on any of the primary key columns or on columns of a VIEW derived from a primary key. That is, the RDBMS must prohibit the above query in an updatable cursor because the number of affected rows in a cursor operation cannot be predeterminately restricted to one row.
FIG. 3D illustrates a potential integrity violation in a self-referencing table subject to a DELETE WHERE CURRENT OF operation (a "positioned" DELETE). The foreign key column ADMINDNO is constrained with a CASCADE delete rule in this example query:
The number of rows retrieved by this query depends on the sequence in which the rows are retrieved. If row 64 is first retrieved, the delete of primary key 1 cascades to row 66 and no more rows remain to be retrieved. If row 66 is first retrieved, the delete of primary key 2 does not cascade and row 64 is next retrieved.
A DELETE statement that has a subquery referencing a table that can be affected by the deletion of rows from the target table violates database integrity during table mutation and is accordingly prohibited in most RDBMSs. FIG. 3E illustrates an example of the potential problem arising from a DELETE with a subquery. For this example, another relationship has been set up between EMPLOYEE and DEPARTMENT. This relationship shows to which department the employee reports, which may differ from the one in which the employee works. FIG. 3F shows the records in DEPARTMENT and FIG. 3G shows the records in EMPLOYEE. This problem arises upon attempted execution of the SQL query:
The SQL query is attempting to delete any department having no employee reporting to it. The result depends on the sequence in which the DEPARTMENT records are read. If department A is read first, the NOT IN clause is true so department A is deleted. This cascades from FIG. 3F to REPORTS in FIG. 3G, deleting EMPNOs 1 and 2. When DEPTNO B is read, the NOT IN clause is again true because EMPNOs 1 and 2 were deleted. Therefore, DEPTNO B is now deleted, which eliminates EMPNO 3 from FIG. 3G. Finally, DEPTNO C is read and deleted. If DEPTNO B is first accessed, the NOT IN clause is false and department B is not deleted. Similarly, department C avoids deletion. Accordingly, SQL queries must avoid multi-row deletions by subquery and instead delete one record at a time using the primary key as a predicate, which is an inconvenient user restraint.
The concept of a "trigger" is well-known in the art, and triggers are explicitly included in the SQL-93 standard promulgated by the ISO. For any event that causes a change in contents of a table, a user may specify an associated action that the RDBMS must execute. The three events that can "trigger" an action are attempts to INSERT, DELETE or UPDATE records in the table. The action triggered by an event may be specified by a sequence of SQL statements. Reference is made to the above-cited Owens et al. reference and the above-cited Groff et al. reference for detailed examples of row-level and statement-level triggers. Consider the following example.
When a new order is added to the ORDERS table, the SALES column for the salesperson who took the order should be increased by the amount of the order and the QTY.sub.-- ON.sub.-- HAND amount for the product being ordered should be decreased by the quantity ordered. This requirement is known in the art as a "business rule". A SQL trigger, named NEWORDER, can be included in a RDBMS to automatically update these database tables:
The first part of the trigger definition conditions trigger invocation on the attempt of an INSERT on the ORDERS table. The remainder of the definition, following the keyword AS, defines the action of the trigger. In this example, the action is a sequence of two UPDATE statements, one for the SALESREPS table and one for the PRODUCTS table. The INSERTED pseudo-table refers within the UPDATE statements to the record to be inserted by the trigger. This particular example is a "for each statement" trigger that does not create a mutating table violation. As this example demonstrates, there are several non-standard versions of the SQL known in the art that extend the SQL-92 standard substantially to support triggers.
Triggers also provide a useful method for implementing the referential integrity requirements imposed on tables by foreign keys and primary keys and can implement extended forms of referential integrity. For instance, triggers may provide a CASCADE UPDATE feature in a SQL system that does not implement the CASCADE rule. Although the declarative referential integrity and table constraint language is preferred, triggers can be used to implement the transitional "business" rules that cannot be enforced with declarative statements.
In the above-referenced '576 patent, we disclosed a RDBMS which solved the foregoing problems by providing a system and method for optimizing the execution plan of a query in a manner that enforces database integrity. The concept of a data dam was introduced, whereby database integrity is enforced by reforming a query graph to restrict data flow paths between common referencing pairs of quantifier nodes so as to sequence the execution of such nodes. In a SN-RDBMS the foregoing system and method may not prevent all database integrity violations. A SN-RDBMS runs in a parallel processing environment in which queries are divided into distributed units of execution (known as distributed subsections or DSSs). Unlike serial processing databases, where query execution can be ordered (using subqueries) in a demand driven "top-down" fashion, a SN-RDBMS executes queries in a "bottom-up" fashion by sending DSSs (which may be subqueries of some containing query) out to the LPUs for processing. This distribution occurs at the commencement of query execution, i.e., the DSSs are processed in parallel. Lower level subqueries, such as reads, may therefore be processed prior to the execution of some other subquery, such as a write, even though a data dam has been placed to prevent premature execution of the lover level subqueries. In a SN-RDBMS, therefore, merely placing a data dam does not always guarantee that common referencing pairs of quantifier nodes will be executed in the desired order.
Accordingly, there is a clearly-felt need in the art for an automated process in a SN-RDBMS that enforces database integrity during table mutation in a query optimizer with full implementation of row-level and statement-level triggers. There is also a need in the art for an automated mutating table integrity enforcement process in a SN-RDBMS that eliminates run-time errors without SQL user query prohibitions. Modern query execution systems depend heavily on effective query optimization to reduce execution time and resource consumption and database integrity violation detection at run-time is too late for correcting the QEP to remove database integrity violation problems. There is in the art an unmet need for a query processing system in a SN-RDBMS that automatically restricts query execution plans to those plans having no mutating table integrity violations to avoid QEP failure at run-time. These unresolved problems and deficiencies are clearly felt in the art and are solved by this invention in the manner described below.