1. Field of the Invention
This invention relates generally to systems for automatic query optimization and execution in parallel relational database management systems and particularly to a system for efficiently executing complex queries across a plurality of nodes of a distributed relational database management system while maintaining database integrity.
2. Description of the Related Art
A relational database management system (RDBMS) is a computer-implemented database management system that uses relational techniques for storing and retrieving data. Relational databases are computerized information storage and retrieval systems in which data in the form of tables (formally denominated "relations") are typically stored for use on disk drives or similar mass data stores. A "table" includes a set of rows (formally denominated "tuples" or "records") spanning several columns. Each column in a table includes "restrictions" on the data contents thereof and may be designated as a primary or foreign key. Reference is made to C. J. Date, An Introduction to Database Systems, 6th edition, Addison-Wesley Publishing Co. Reading, Mass. (1994) for a comprehensive general treatment of the relational database art.
A RDBMS is structured to accept statements to store, retrieve and delete data using high-level query languages such as the Structured Query Language (SQL). The term "query" denominates a set of statements for retrieving data from a stored database. The SQL standard has been promulgated by the International Standards Association since 1986. Reference is made to the SQL-92 standard "Database Language SQL" published by the ANSI as ANSI X3.135-1992 and published by the ISO as ISO/IEC 9075:1992 for the official specification of the 1992 version of the Structured Query Language. Reference is also made to James R. Groff et al. (LAN Times Guide to SQL, Osborne McGraw-Hill, Berkeley, Calif., 1994) for a lucid treatment of SQL-92, and to Don Chamberlin, Using the New DB2, Morgan Kaufmann Publishers, San Francisco, 1996, for a discussion of that language, which implements SQL. Finally, reference is also made to several IBM technical publications, to include definitions for some of the terminology used herein:
IBM Systems Journal, Vol. 34, No. 2, 1995.
DB2 AIX/6000 Admin. Guide, SC09-1571.
DB2 Parallel Edition for AIX Admin. Guide and Reference, SC09-1982.
DB2 Parallel Edition for AIX Performance and Tuning Guide, SG24-4560.
The phenomenal growth rate and use of client/server networks and database systems, as well as recent technological advances in parallel computing have given rise to the use of commercial parallel database systems by many major corporations. Businesses and other users today require database management systems that can sustain the burgeoning demands of decision support, data mining, trend analysis, multimedia storage, and the use of complex queries in a variety of applications. Parallel processing systems provide the availability, reliability, and scalability that previously were available only from traditional mainframe systems.
There currently exist several hardware implementations for parallel computing systems, including but not necessarily limited to:
Shared-memory approach--processors are connected to common memory resources; all inter-processor communication can be achieved through the use of shared memory. This is one of the most common architectures used by systems vendors. Memory bus bandwidth can limit the scalability of systems with this type of architecture.
Shared-disk approach--processors have their own local memory, but are connected to common disk storage resources; inter-processor communication is achieved through the use of messages and file lock synchronization. I/O channel bandwidth can limit the scalability of systems with this type of architecture.
Shared-nothing approach--As shown in prior art FIG. 1, in a shared-nothing implementation 100, processors 102 have their own local memory 104 and their own direct access storage device (DASD) such as a disk 106; all inter-processor communication is achieved through the use of messages transmitted over network protocol 108. A given processor 102, in operative combination with its memory 104 and disk 106 comprises an individual network node 110. This type of system architecture is referred to as a massively parallel processor system (MPP). While this architecture is arguably the most scalable, it requires a sophisticated inter-processor communications facility to send messages and data between processors. One shared-nothing implementation of IBM's DB2 Parallel Edition approaches near-linear performance scale-up as nodes are added to a parallel system.
Exploiting MPP architecture
DB2 currently executes on at least IBM's AIX/6000 platform, and supports the execution of multiple database servers (or nodes) on a number of processors including the SP2 and on multiple RISC/6000 processors connected via basic LAN communication facilities. DB2 instances that run on the SP2 using the high performance switch have significant performance gains and broader processor scalability than instances running on RISC/6000 clusters. Both of these hardware architectures apply the multiple-instruction, multiple-data (MIMD) approach to parallelism.
MIMD-based Parrallism
This approach to parallelism calls for multiple instruction streams to be simultaneously applied to multiple data streams. The biggest challenges facing database systems that use the MIMD form of parallelism are:
a. implementing a distributed deadlock detection and multi-phase commit protocol PA1 b. handling inter-processor communications effectively and efficiently PA1 c. breaking a processing request into manageable, independent components.
DB2 PE (Parallel execution) was developed as an extension to DB2 for the AIX/6000 Version 1 product. The non-parallel version is now referred to as DB2 Standard Edition. The DB2 PE extension, hereafter simply referred to as DB2, constituted, in the most basic terms, a replication of the database and database server components into an array of nodes or a database and Database Manager partition. Data is distributed across one or more nodes in a parallel configuration and the Database Manager at each node manages its part of the database independently from all other nodes. The ratio of processors to nodes can be one-to-one or one-to-many. The nodes defined in a one-to-many configuration are referred to as logical nodes. The parallel extension includes a function shipping strategy to send individual work requests to each of the nodes in a transaction.
The DB2 function shipping strategy involves decomposing a SQL statement into smaller parts, or sub-requests, and then routing the sub-requests to the nodes that are responsible for a portion of the table. Sub-requests are run in parallel; each node sends back qualifying rows to the initiator of the request which, in turn, builds the final answer set that is returned to the user or application.
Flow diagrams of a SQL statement that has been fully decomposed or partitioned on both a parallel and a non-parallel DB2 server, as detailed in prior art FIG. 1, are shown at prior art FIGS. 2A and 2B. In a serial implementation of DB2, the query plan is executed on a single node, as shown in FIG. 2A. A parallel implementation of DB2 enables the decompositioning, or partitioning of a query plan into a plurality of subplans, also referred to as subqueries, which are executed on the several nodes of the network, as shown in FIG. 2B.
The monolithic query plan which executes a complex query on a single node (UNI) is shown at FIG. 3A as 300. In contrast, a shared nothing MPP query executing the same plan on a number of nodes is shown in FIG. 3B.
Parallel Transactions
Having reference to the latter figure, plan 300 is partitioned by the DBMS into a number of subqueries, also referred to as subplans, or sections, 302-308. The subplans are logically connected to one another by means of table queues (TQs), 402-408. Each subplan is executed on a specified node or nodegroup. A nodegroup consists of 1-n nodes.
Referring now to FIG. 4, subplan 306 is shown to include at least one, and typically a plurality of operators 402-408, forming a strongly-connected component called a cycle, 410. It wil be understood by those having ordinary skill in the art that not all strongly-connected components are cycles. This cycle is in the query flow, and must be completed on tables T1 and T2, and the results passed to subquery 304. Each of the logical data connections between the several subplans, as well as between the subplans and their respective base tables are made by means of tables queues (TQs). DB2 supports both types of SQL decomposition because it allows tables to be defined on one or on many nodes. The physical distribution of data by DB2 is completely transparent to a user or to an application that accesses the data via SQL.
SQL statements executed by a parallel DB2 server are decomposed and processed using a master-slave agreement between processes and processors. A sample DB2 processing model created by an application that issues a simple query against a table with data spread out across several nodes is shown at prior art FIG. 5. When an application 500 initiates a connection to a parallel database at a node (the coordinating node, 502), a master process, or coordinating agent A1 504, is created to delegate work to subordinate processes or parallel agents A2, A3, A4 (506-510 respectively), located on both coordinating node 502 and other parallel nodes 512 and 514. Each node has access to a partition of the table, 516. The coordinating node and agent remain associated with an application until the application completes.
Coordinating Node
A database connection can emanate from any node in a parallel system, regardless of the location of tables processed by the application. The node at which a connection is started automatically becomes the coordinating node for that application. Consider the location and the frequency of database connects carefully when designing application systems. A large database system could experience bottlenecks at the coordinating node that is required to do heavy sorting when running queries. The coordinating node for a for a given application is a single node. Different applications executed on the same parallel system may have different designated coordinating nodes.
Distributing Data for Query Performance
DB2 allows you to divide tables across all of the nodes in a configuration or across subsets of nodes. This latter type of distribution is referred to as partial declustering. The rows of a table are distributed to the nodes defined in a nodegroup list specified on the CREATE TABLE statement. To distribute a row to the appropriate node, a hash partitioning strategy is used.
Each table in a multi-node nodegroup uses a partitioning key specified or defaulted to when the table was created. When a table row is inserted, a hashing algorithm is applied to the partitioning key value to produce an index into a partitioning map, which is an array of 4096 entries. Each entry in the partitioning map contains a node number where the row is to be stored. IBM has implemented a share-nothing version of its DB2 Parallel Edition, referred to as DB2, which approaches near-linear performance scale-up as nodes are added to a parallel system.
Choosing an appropriate partitioning key is important because it has a direct impact on performance. This has a direct bearing on the join strategies that the optimizer uses. The optimizer selects join strategies based on a lowest-cost strategy. Prior art FIGS. 6-8 provide an illustration of the different join methods used by DB2. The join methods, in lowest to highest computational cost order, are:
collocated--referring to FIG. 6, joined tables 604 and 604' are located on the same node 402 and each corresponding partitioning column is in an equal predicate; this type of join is done at the node-level and requires less communications overhead;
directed--referring to FIG. 7, joined tables 704 and 704' are sent from their nodes, 506 and 508 respectively, to another node, 710, where the local join is done
repartitioned--referring to FIG. 8, rows from both tables are redistributed to a single node and repartitioned on the joining attributes; and
broadcast--again referring to FIG. 8, all of the rows of one table are broadcast to the rows of the other table; this type of join incurs a high amount of communications overhead and should be avoided if possible.
As used herein, a "query" refers to a set of statements for retrieving data from the stored database. The query language requires the return of a particular data set in response to a particular query but the method of query execution ("Query Execution Plan") employed by the DBMS is not specified by the query. There are typically many different useful execution plans for any particular query, each of which returns the required data set. For large databases, the execution plan selected by the DBMS to execute the query must provide the required data return at a reasonable cost in time and hardware resources. Most RDBMSs include a query optimizer to translate queries into an efficiently executable plan. According to the above-cited Date reference, the overall optimization process includes four broad stages. These are (1) casting the user query into some internal representation, (2) converting to canonical form, (3) choosing prospective implementation procedures, and (4) generating executable plans and choosing the cheapest, or most computationally efficient of these plans.
For example, prior art FIG. 9 shows a query translation process known in the art. Queries written in SQL are processed in the phases shown, beginning with lexing at step 913, parsing and semantic checking at step 914, and conversion to an internal representation denoted the Query Graph Model (QGM) 915, which is a command data-structure that summarizes the semantic relationships of the query for use by the query translator and optimizer components. A query global semantics (QGS) process 917 adds constraints and triggers to QGM 915. A QGM optimization procedure 916 then rewrites the query into canonical form at the QGM level by iteratively "rewriting" one QGM 915 into another semantically equivalent QGM 915. Reference is made to U.S. Pat. No. 5,367,675 issued to Cheng et al., entirely incorporated herein by this reference, for a discussion of a useful QGM rewrite technique that merges subqueries. Also, reference is made to U.S. Pat. No. 5,276,870 wherein Shan et al. describe a QGM optimization technique that introduces a "view" node function to the QGM to permit base table references to "VIEWs" by other nodes. This conditions the QGM to permit the execution plan optimizer 918 to treat a view like a table. Finally, reference is made to U.S. Pat. No. 5,546,576 for a discussion of a Query Optimizer System which detects and prevents mutating table violations of database integrity in a query before generation of an execution plan. QGM 15 used in the Query Rewrite step 16 can be understood with reference to Pirahesh et al. ("Extensible/Rule-Based Query Rewrite Optimization in Starburst", Proc. A CM-SIGMOD Intl. Conf on Management of Data, San Diego, Calif., pp. 39-48, June 1992).
A useful QGM known in the art, and described in the '576 reference is now described in detail. FIG. 15 provides a QGM graphical representation of the following SQL query:
______________________________________ SELECT DISTINCT Q1.PARTNO, Q1.DESCR, Q2.PRICE FROM INVENTORY Q1, QUOTATIONS Q2 WHERE Q1.PARTNO=Q2.PARTNO AND Q2.PRICE &gt; 100 ______________________________________
A SELECT box 1524 is shown with a body 1526 and a head 1528. Body 1526 includes data-flow arcs 1530 and 1532, which are also shown as the internal vertices 1534 and 1536. Vertex 1536 is a set-former that ranges on (reads from) the box 1538, which provides records on arc 1532. Similarly, vertex 1534 ranges on box 1540, which flows records on data-flow arc 1530. The attributes to be retrieved from the query, PARTNO 1546, DESC 1548 and PRICE 1550, are in head 1528. Boxes 1538 and 1540 represent the base tables accessed by the query, INVENTORY 1542 and QUOTATIONS 1544, respectively. Box 1524 embraces the operations to be performed on the query to identify the PARTNOs that match in the two base tables, as required by the join predicate 1552 represented as an internal predicate edge joining vertices 1534 and 1536. Vertex 1534 also includes a self-referencing predicate 1554 to identify prices of those PARTNOs that exceed 100.
For the purposes of this invention, note that each box or node (formally denominated "quantifier node", or QTB) in FIG. 15 is coupled to one or more other nodes by data-flow arcs (formally denominated "quantifier columns" or QUNs). For instance, base table node 1538 is coupled to select node 1524 by data-flow arc 1532 and base table node 1540 is connected to select node 1524 by data-flow arc 1530. The activities inside select node 1524 produce a new stream of data records that are coupled to the TOP node 1556 along a data-flow arc 1558. TOP node 1556 represents the data output table requested by the query.
The object of several known QGM optimization procedures is to merge one or more nodes where possible by eliminating (collapsing) data-flow arcs. For instance, the above-cited Pirahesh et al. reference describes a set of rules for merging any number of nodes into a single SELECT node, with certain restrictions on non-existential or non-Boolean factor subqueries, set operators, aggregates and user-defined extension operators such as OUTER JOIN. Thus those skilled in the art know that QGM optimization step 916 usually rewrites the QGM to eliminate numerous nodes and data-flow arcs even before considering useful query execution plans in plan optimization step 918 (FIG. 9). Also, most execution plans usually pipeline data along the data-flow arcs without waiting to complete execution of a node before flowing data to the next node.
QGM optimization procedure 916 rewrites QGM 915 to simplify the subsequent plan optimization process 918, which produces Query Execution Plans (QEPs). Plan optimization procedure 918 generates alternative QEPs and uses the best QEP 920 based on estimated execution costs. The plan refinement procedure 922 transforms optimum QEP 920 by adding information necessary at run-time to make QEP 920 suitable for efficient execution. Importantly, the QGM optimization step 916 is separate and distinct from the QEP optimization in step 918. Reference is made to U.S. Pat. No. 5,345,585 issued to Iyer et al., entirely incorporated herein by this reference, for a discussion of a useful join optimization method suitable for use in QEP optimization step 918. Reference is made to U.S. Pat. No. 5,301,317 issued to Lohman et al., entirely incorporated herein by the reference, for a description of an adaptive QEP optimization procedure suitable for step 918.
QGM 915 used in the Query Rewrite step 916 can be understood with reference to Pirahesh et al. ("Extensible/Rule-Based Query Rewrite Optimization in Starburst", Proc. A CM-SIGMOD Intl. Conf on Management of Data, San Diego, Calif., pp. 39-48, Jun. 1992).
The concept of a "trigger" is well-known in the art, although triggers are not explicitly included in the SQL-92 or SQL-93 standard promulgated by the ISO. For any event that causes a change in contents of a table, a user may specify an associated action that the DBMS must execute. The three events that can "trigger" an action are attempts to INSERT, DELETE or UPDATE records in the table. The action triggered by an event may be specified by a sequence of SQL statements. Reference is made to the above-cited Owens et al. reference and the above-cited Groff et al. reference for detailed examples of row-level and statement-level triggers. Reference is made to Chamberlin, previously cited, for a discussion of triggers, and to "Integrating Triggers and Declarative Constraints in SQL Database Systems", Proc. of the 22.sup.nd Int. Conf. On Very Large Databases (1996) (Cochrane, et al.), for a discussion of the basis for the model of SQL-3 triggers.
The single overriding problem with implementing some complex queries on shared nothing parallel database systems (MPP), is that it is very difficult to perform actions that require local computation or local coordination of computation on data that is distributed among many nodes. Examples of such actions include, but are not necessarily limited to: recursive statements; the assignment of values to variables; the execution of the several statements of a row-level trigger; correlations to common subexpressions; the performance of any function that cannot be run in parallel (including those functions which must be run at the catalog node because they access catalog structures); the use of special registers use in buffered inserts; the provision for scrollable cursors which are a result of queries on distributed data; and, the checking of unique indexes which are distributed in MPP. These problems are briefly expounded below:
Recursion
By its very nature, recursion has an implied control flow that implements "until no more data, continue computing the query". This control flow is difficult to coordinate among nodes in a distributed system.
Assignment Statements
The assignment of a value to a variable must be performed on the node on which the target buffer location exists and will be used in the future. Further, a solution to this particular problem should allow several instances of the operations to run in parallel as long as a given assignment remains in the same node with its associated buffer.
Row-level Triggers
The statements of a row-level trigger must be performed for one row at a time. However, as long as there are not any semantic conflicts between the statements in a trigger, the statements of the trigger can be executed in parallel for each row.
Correlation to Common Subexpression
If a common subexpression is correlated, then all individual consumers of the expression must be collocated. Otherwise, the correlation values would be arriving at the common subexpression from diverse paths and it would be difficult to coordinate these correlation values.
Non-parallelizable and Catalog Functions
Any function that cannot be run in parallel (as dictated by the user during the CREATE FUNCTION) must be run on a single node. A special subclass of non-parallelizable functions must be run at the catalog node because they access catalog structures.
Special Register use in Buffered Insert
A discussion of the general topic of buffered inserts will be found in IBM Technical Publication DB2 Parallel Edition for AIX Admnin. Guide and Reference, SC09-1982. Since special register values exists in each section, these values are normally picked up directly from the section that uses them. However, for buffered insert, these values must come from a single node which is the coordinator. The reason is that each SQL statement creating a row to be inserted may have new values for special registers. Therefore, the value of special registers must be sent, via a TQ, to the nodes that execute the buffered insert.
Scrollable Cursors
In order to provide scrollable cursors that are a result of queries on distributed data, the results of the query must be shipped and stored at the coordinator node.
Deferred Unique
For FIPS, unique index checking is only performed at the end of a statement. This is after all other constraints are enforced, in particular cascaded referential integrity constraints. The check is verified by calling a data manager routine to perform the check for unique preservation on the index. However, indexes are distributed in MPP, and hence this function must be evaluated on each node where the potentially violated indices reside.
Each of the preceeding complex query component issues illustrates a common problem in the processing of complex queries across the several nodes of a shared nothing parallel database system (MPP): it is very difficult to perform actions that require local computation or coordination of computation on data that is distributed among many nodes. In performing complex queries which incorporate any of the previously discussed computational actions, a user has heretofore been faced with a difficult choice: either limit the scope of the allowed query to preclude those actions which require local computation and thus enable use of the several nodes of the network; or enable those actions, but only to tables that are not partitioned across nodes, thereby losing the computational power of the distributed system.
What is needed then is a methodology, and an apparatus for practicing the methodology, which enables the power and flexibility inherent in shared nothing parallel database systems (MPP) to be utilized on complex queries which have, heretofore, contained query elements requiring local computation or local coordination of data computation performed acrons the nodes of the distributed system.
What is further needed is a methodology which enables, on shared nothing parallel database systems (MPP), the performance and execution of complex queries which include recursive statements.
What is also needed is a methodology which enables, on shared nothing parallel database systems (MPP), the performance and execution of complex queries which include the assignment of value to a variable.
What is yet further needed is a methodology which enables, on shared nothing parallel database systems (MPP), the performance and execution of complex queries which include the use of row-level triggers.
What is still further needed is a methodology which enables, on shared nothing parallel database systems (MPP), the performance and execution of complex queries which include correlation to a common subexpression.
What is moreover further needed is a methodology which enables, on shared nothing parallel database systems (MPP), the performance and execution of complex queries containing non-parallelizable functions and catalog functions.
What is still further needed is a methodology which enables, on shared nothing parallel database systems (MPP), the performance and execution of complex queries containing special registers useable in buffered inserts.
What is yet further needed is a methodology which enables, on shared nothing parallel database systems (MPP), the performance and execution of scrollable cursors resulting from queries on distributed data.
What is finally further needed is a methodology which enables, on shared nothing parallel database systems (MPP), the performance and execution of complex queries which enable the deferral of uniques checking until the logic end of a statement is completed.
These unresolved problems and deficiencies are clearly felt in the art and are solved by this invention in the manner described below.