Database management systems (DBMS) support the definition, retrieval, and updating of data stored in a database. A relational database management system (RDBMS) is a particular form of a database system where data is stored in a tabularized form. The database tables consist of sets of rows which share common characteristics.
The database is physically stored as pages of data on nonvolatile storage devices such as direct access storage devices (DASD). An index can also be stored on DASD listing a directory for locating specific data which aids in the retrieval of that data.
Structured query languages, such as the SQL language, have been developed for relational database systems to access the data in a database. The relational database management systems implement the functions of the SQL language.
An example of a table stored in a relational database system is an EMPLOYEE.sub.-- SALARY table which stores salary information for all employees of a company or enterprise as follows:
______________________________________ emp.sub.-- no emp.sub.-- name dept.sub.-- no salary ______________________________________ 1001 Sinnite, J D01 20000 1002 Williams, J D01 33000 1003 Crane, P D02 25000 ______________________________________
The EMPLOYEE.sub.-- SALARY table has columns for employee number (emp.sub.-- no), employee name (emp.sub.-- name), department number (dept.sub.-- no) and salary. Within the table, each row lists an employee information set called a tuple comprising the employee number, name, department number and salary for a given employee.
Computer programmers write application programs to access and maintain the data in the database. The application programs are executed by the database management system. The application programs need to be processed by the central processing unit (CPU) of the computer system for execution by the CPU. There are three stages in processing an application program--the precompilation, the compilation (also referred to as bind time) and the execution (also referred to as execution time or run time). During precompilation, all SQL statements are extracted from the application program. The SQL statements are used to access the data from the database.
SQL statements specify what data is wanted but not how to get the data. The relational database management system determines the optimal method for accessing and retrieving the requested data. A strategy is deemed optimal in terms of its minimizing the resource utilization costs. Typically, a query optimizer process is performed, during bind time, to determine an optimal strategy. The optimal query processing strategy is transformed, also during bind time, into a query execution plan which consists of a series of steps executable by the CPU to access the requested data and satisfy the query. During execution time, the execution plan as constructed during bind time is executed. The query execution plans can be stored and executed or re-executed as needed.
The application program can include host variables which can change the execution of the embedded SQL statement. A host variable is a PL/I variable, a C variable, a FORTRAN variable, a COBOL data item, or an assembly language storage area referenced in a SQL statement. During execution time, when the query execution plan is activated and executed, values for the host variables are passed from the application program to the query execution plan.
An example of a SQL statement using a host variable is as follows:
Select MAX(SALARY) PA1 From EMPLOYEE.sub.-- SALARY PA1 Where DEPT.sub.-- NO=:dept.sub.-- no;
The host variable in the preceding Select statement is ":dept.sub.-- no". The SQL Select statement performs the operation of retrieving the maximum salary for employees in the department specified by the host variable. An EMPLOYEE.sub.-- SALARY table is accessed to satisfy the preceding query statement. One of the columns (attributes) of the table is dept.sub.-- no (the employee's department number). The MAX function identifies the maximum salary for the employees in the given department. The host variable dept.sub.-- no allows the department number for the query to be specified by the application program. During bind time, the value of the host value variable is unknown.
In order to satisfy the preceding query, the query access strategy needs to specify how the EMPLOYEE.sub.-- SALARY table will be accessed and how the MAX function will be evaluated. The EMPLOYEE.sub.-- SALARY table can be accessed either using a sequential scan of all the pages of records stored in the DASD or by using an index for the table.
In order to improve the RDBMS performance in evaluating and satisfying queries, it is desirable to exploit the inherent parallelism in multiple CPUs or I/O devices available in the computer system during execution. For example, when performing a sequential table scan of a table that is stored across multiple I/O devices, the table scans on the separate I/O devices can be performed at the same time to reduce I/O time by utilizing the concurrency of multiple asynchronous I/O operations on the devices. Parallelism can also be exploited by using multiple CPUs to evaluate the data according to criteria provided by a query, so that total CPU time is lowered. A more complex parallelism operation involves partitioning the query execution plan among CPUs and executing operations in parallel. The query optimizer needs to consider whether a parallel strategy should be invoked when determining the optimal strategy that minimizes CPU time and resource utilization costs.
A parallel execution plan expresses the degree and unit of parallelism. The degree of parallelism is the number of CPUs and I/O streams (processes) used for each unit of parallelism. The unit of parallelism refers to the group of operations (executable steps in an execution plan) assigned to the same process for execution where there is no blocking for operations within a unit. Blocking occurs when an operation has to wait for another operation to complete. For example, blocking occurs when an operation needs to wait for an intermediate result to be stored in a temporary table.
Finding the optimal parallel execution strategy for a query is a difficult problem because of the large number of different execution strategies available for processing a query (i.e. the large solution space). For a given query, the optimizer may need to decide the sequence of the join operations, the methods of implementing the join operations, the methods for accessing tables and which, if any, indexes to use to access the data on the storage device and to sort the data. In an execution environment (computer system) where parallelism is available, the query optimizer also needs to decide which, if any, operations could be optimized through parallel processing. The optimizer must then further decide how to perform each such operation in a parallel mode. Therefore, finding the optimal parallel execution strategy for a query is considerably more difficult and CPU intensive than determining a sequential execution strategy.
One approach to optimizing queries using parallel execution is to first determine the best sequential query execution strategy. The parallel strategies are then identified based on the best sequential plan. In that way, the time to produce a parallel query execution strategy is reduced by having limited the solution space. The parallel strategy based on the best sequential strategy may not be the optimal strategy compared with all feasible parallel strategies. However, this solution is generally considered to be a good compromise considering the time it would take to find the real optimal parallel strategy.
A parallel execution strategy was proposed by Hong and Stonebraker in "Optimization of Parallel Query Execution Plans in XPRF", Proceedings of the First International Conference on Parallel and Distributed Information Systems, 1991, pp. 218-225. This parallel execution strategy makes a division between the optimization procedures performed during compile or bind time and the optimization strategy that is performed during execution time, with the majority of the work being done during execution time. During bind time, the optimal sequential strategy is determined. Also during bind time, choose nodes are inserted into the sequential plan to enable the modification of the sequential plan during execution time with regard to the join method and the other similar strategies. The actual parallel execution strategy is generated during execution time. Both the unit and degree of parallelism are determined during execution time.
The more work that is done towards generating the execution strategy prior to execution time, the greater the amounts of CPU time and other run time resources are conserved, particularly for queries that are compiled and then re-executed a number of times.
However, the parallel plan should not be entirely determined during bind time. An optimal parallel strategy may seem optimal at bind time but may become sub-optimal during execution time due to the execution time environment variables referred to as "run time" variables, such as the amount of free adjustable buffer pool space, the number of free CPUs and I/O streams, and the values of the run time host variables.
There is a need for a process for determining a parallel execution strategy during bind time to minimize the consumption of execution time while allowing some flexibility to make changes to the plan in response to the run time environment. Particularly for large enterprises running numerous queries on a database, it is important that the compiled logic that is stored for a query be maximized so that only a minimal but sufficient number of decisions are made during execution time. It is also desirable to be able to execute the best sequential plan without modification when the runtime environment is such that the sequential plan would be optimal.
One or more of the foregoing problems are overcome and one or more of the foregoing needs are satisfied by the present invention.