1. Field of the Invention
This invention relates in general to computer-implemented database systems, and more particularly to a method, apparatus and program storage device for determining an optimal number of tasks during reorganization of a database system with memory and processor constraints.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. Relational databases are organized into tables that consist of rows and columns of data. The rows are formally called tuples. A database will typically have many tables and each table will typically have multiple tuples and multiple columns. The tables are typically stored on direct access storage devices (DASD) such as magnetic or optical disk drives for semi-permanent storage.
A table can be divided into partitions, with each partition containing a portion of the table's data. By partitioning tables, the speed and efficiency of data access can be improved. For example, partitions containing more frequently used data can be placed on faster data storage devices, and parallel processing of data can be improved by spreading partitions over different DASD volumes, with each I/O stream on a separate channel path. Partitioning also promotes high data availability, enabling application and utility activities to progress in parallel on different partitions of data.
Indexing is a technique used by most current database management systems to speed up particular kinds of queries (usually by internally generating and storing redundant information to more quickly locate table entries). An index is an ordered set of references to the records or rows in a database file or table. The index is used to access each record in the file using a key (i.e., one of the fields of the record or attributes of the row).
Over time and with frequent use, databases often become disorganized. Constant additions, deletions and updates cause data to become disorganized. When that happens, retrieving data involves extensive CPU, I/O and elapsed-time, costing time, money and end-user productivity. In turn, customers are lost and revenue decreases. To reduce these costs, data must be reorganized.
Accordingly, numerous attempts have been made to reorganize such databases. Reorganization of a database includes changing some aspect of the logical and/or physical arrangement of the database. Most database management systems (DBMS's) today provide some type of reorganizer utility that provides online reorganization capability.
Reorganizing a database can require significant time and usually involves taking the database offline. Typically, the database may be unloaded to tape or DASD and then reloaded to restore data clustering. Secondary indices for the reorganized database must be reorganized or rebuilt. Usually, a backup image copy of the reorganized database is made. But, because many applications require 24×7 availability of databases, there has been increasing demand to minimize any outages caused by utilities such as reorganizers. An online reorganizer minimizes outages by performing most of its processing while other applications still have access to the database.
In an attempt to speed up the loading of data, various approaches have been tried involving the use of parallel processing. Parallel processing exploits the multiprocessor capabilities of modem high-speed computers and refers to the use of several processors. Parallel execution in reference to database operations applies to the many functions performed on a database, including joins, sorts, load balancing, reorganization, data load, index creation, indexed access, backup, and restore functions. Some or all of such functions may be performed on all database partitions simultaneously. For example, a query may be broken into a number of pieces that can be executed in parallel through intelligent database partitioning. When a database is partitioned, the data may be distributed across database partitions, or subsets of the database, which can reside on multiple servers or within a large SMP server. A partition map facilitates the management of the distribution and redistribution of the data as required.
Nevertheless, such parallelism adds a level of complexity to a database system. For example, loading techniques that rely upon parallel execution of subtasks may need to be optimized to decrease processing times. One such system is disclosed in co-pending patent application by Garth et al. having Ser. No. 09/501493, filed Feb. 9, 2000 and entitled “TECHNIQUE FOR DETERMINING AN OPTIMAL NUMBER OF TASKS IN A PARALLEL DATABASE LOADING SYSTEM WITH MEMORY CONSTRAINTS”, which patent application is assigned to the same assignee as here in, and is hereby incorporated by reference in its entirety. In the Garth et al. application, the optimum number of tasks to be processed by the system for load and sort processes is determined by identifying the memory constraints of the system, identifying available processing capabilities, and determining a number of load and sort processes to be started in parallel based on the identified memory constraints and processing capabilities. However, in contrast to a database system being reorganized, the Garth et al. application does not take into consideration the number of data and index unload subtasks in its taskset calculation.
In addition, during reorganization unload, reload, sort, and build processes may all be active at the same time, so the amount of taskset calculation is further constrained by the internal distribution of processes in addition to the memory and processor constraint.
It can be seen that there is a need for a method, apparatus and program storage device for determining an optimal number of tasks during reorganization of a database system with memory and processor constraints.