1. Field of the Invention
The invention relates to reorganizing database data and index files, particularly DB2 tablespaces, into key order without utilizing conventional sorting procedures, while allowing the tablespaces to be viewed during reorganization and allowing prompt recovery or restarting of the process if interrupted before completion.
2. Description of the Related Art
Databases are used on computers for a myriad of reasons. In many cases the databases are extremely large, having entries in the millions. When databases reach this size, and the information is needed in a transactional or real time basis, mainframe computers are utilized. International Business Machines Corp. (IBM) has developed a database environment referred to as DB2 for use on its mainframes. Given IBM's high market percentage in mainframes, DB2 is clearly the leading database system.
A tablespace is the DB2 term used to identify a database. Tablespaces can be simple, segmented or partitioned. In a simple tablespace, the data is kept in one file and there may be a clustering index and other indices. A clustering index is the index where the keys are kept in sequential order and the data is preferably kept in this same order. In a segmented tablespace, many different logical data files or tables are kept in a single file and there may be a clustering index and other indices for each logical data file. There are no indices directed to multiple logical files. In a partitioned tablespace, the data is kept in different files, but there is a clustering index for each file. There may be additional indices directed to all of the partitions.
DB2 uses a balanced tree index structure. In this structure, root, tree and leaf pages are used, with each page at each level containing the same number of entries, except of course the last one. The leaf pages are the lowest level and each contains a number of entries referring to the actual data records contained in the DB2 data tables. Each leaf page is maintained in internal logical order automatically by DB2. Tree pages are the next level up, and are used to indicate the logical order of the leaf pages. For large databases, there may be many several layers of tree pages, a higher level of tree pages referencing a lower level of tree pages. Ultimately the number of tree pages is reduced such that all the entries or references fit into a single page referred to as the root page. As in leaf pages, within each tree or root page the entries are kept in logical order by DB2.
The data tables or files are organized into pages. The first page of each table is a header page. The second page and certain pages thereafter are referred to as space map pages. The header page contains information relating to the entire table while the space map pages include information relevant to free space in a following number of data pages. The actual frequency of the space map pages is based on tablespace characteristics and is well known. The remaining pages are data pages. Up to 255 rows or records of data may be present in a single page. The actual number of rows depends on the size of the row and the size of the page. Each row receives a unique identifier, referred to as the RID, which identifies the page number and relative row number in that page.
One problem with any database is that the physical location of the various pages often becomes quite scattered. This is true for the data pages in the tables and the leaf pages in the indices. A disorganization also develops between the clustering index and the data, so that the table data is no longer physically in its intended logical order. This scattering results in reduced performance as now the storage device must move between widely scattered physical locations if logically sequential operations are to be performed. This is true of whatever type of Direct Access Storage Device (DASD) is used to store the file. Therefore the files need to be reorganized periodically so that the logical and physical ordering of the pages better correspond, thereby improving performance of operations.
IBM provides utilities with DB2 to reorganize the entire tablespace and just the index files. Several other third-party DB2 utility providers also have tablespace and index reorganization packages. These packages usually operate in the same manner. First, the entire file, either tablespace or index, is read in physical order. Each page in the file is then separated into its component record entries. Next, the record entries are sorted by key value using a conventional sort package. Finally, the sorted records are rewritten back into the file. While this process may sound simple, it must be understood that quite often there are hundreds of thousands to millions of entries in the file. When this number of entries is considered, then the process becomes quite time consuming, particularly the sorting step. The third party packages are faster than IBM's utility, but primarily because the sort packages used are more efficient and also because they use standard available sort package facilities, such as sort exits to reduce intermediate file I/O. So even in those cases the process is quite tedious and is done less frequently than desirable, so overall database performance suffers. Therefore it is desirable to have a significantly faster DB2 table and index reorganization method, so that the tables and indices can be reorganized more frequently and overall operations made more efficient.
Additionally, because the reorganization procedures often take large amounts of time, it would be desirable to access the files for viewing purposes, if not for updating purposes, during the reorganization. Further, should for some reason the process be stopped before completion, it would be desirable to have several alternatives to become fully operational without redoing the entire process.