1. Field of the Invention
This invention relates to database management systems, methods, and programs, and more specifically to the reorganization of hierarchical data involving data that is logically related and data that is the target of indexes.
2. Description of the Related Art
Periodically, databases need to be reorganized. Database performance can be maximized if logically related data is clustered together on the storage device. Clustering is intended to reduce disk traffic because a user often accesses logically related data in temporal proximity. However, as users add and delete data elements, the data can become disorganized such that logically related data is no longer clustered physically together.
For some databases, such as the full function database in IBM's IMS system referred to as IMS DL/I, at the time of database creation, a user can distribute free space throughout the data set. This provides for the fact that, typically, databases tend to grow. More data is added through usage than deleted. Distributed free space provides the ability to add logically related data elements throughout the data set while still being physically clustered together. However, typically, databases tend to grow unevenly. More logically related data may be added than what there is storage space for in certain areas. In order to find space for the data that should be clustered with existing data, the data is put out at the end of the data set. This is referred to as overflow which may require maximum arm movement on the DASD storage device to get to that data. As a result, data access performance time will suffer, the I/O access rate goes up, and response time goes up; perhaps to such a degree that this degraded performance becomes noticeable to a user. An user may become aware that the data is becoming disorganized.
A database can become disorganized to the point where a different organization of the data elements may result in more efficient operations on the data, more efficient use of data storage, and increased data capacity. The solution then is to perform a reorganization on the data. Reorganization of a database includes changing some aspect of the logical and/or physical arrangement of the database. Any database management system (DBMS) will require some type of reorganization in order to restore a given level of performance and to improve the degraded capacity of the database. One type of reorganization involves the restoration of clustering and removal of overflows as described above.
Another type of reorganization involves splitting up a full partition of data. Since a 32 bit relative byte address is limited to 4 gigabytes, when a database exceeds 4 gigabytes of data, it has to be partitioned into two or more physical data sets. When a partition is full, it has to undergo a reorganization to get more space.
However, during most types of reorganization, the area being reorganized is typically offline and unavailable to users. The duration of data unavailability for reorganization activities can sometimes be measured in weeks. It is undesirable to have the database go offline for significant periods. It is desirable to reduce the amount of time the database is unavailable by reducing the number of steps involved in a database reorganization operation, and more specifically, for reducing the number of steps involved in reorganizing a hierarchical database.
A hierarchical database management system, such as IBM's IMS DL/I, manages data in a tree structure. Each data element is called a segment, and the first data element is called the root segment of the structure. The root segment is the top of the tree and each subordinate segment is a child of the root or is a child of the child of the root segment or is a child somewhere further down the lineage, i.e., tree. One database is analogous to a forest containing a lot of trees all of which share the same defined segment structure.
IMS DL/I also has keyed access directly to any segment in the database. Every segment could be pointed to by an index. An index is similar to an index in a book that provides readers with one relatively quick means of locating a section of interest. An index is a listing that can derive the location of a desired element, eliminating the need for a "brute-force" sequential search through the collection of elements, thereby providing an alternative form of data access. For example, a root segment could be a serial number, part number, an account number, etc., and the first dependent segment of the root might be a name. An index on the name provides direct access to specific data without first knowing the identification of the root segment or segments that have dependent segments with a given name.
Indexing data by using direct pointers to, and between, data elements is common in databases managed by Database Management Systems (DBMS). It is well known that one can point indirectly to something through an index that can be pointed to directly. However, indexing data is usually the result of defining, by the users of such systems, either beforehand or dynamically, the data for which an index is used.
Another use of direct pointers is for secondary indices which is an index on the data other than its prime sequence which gives an alternate path to the data. Secondary indexing on hierarchical data, such as by users of IMS DL/I, is known. A secondary index is a keyed sequential data set (KSDS). For example, if an element (e.g., type of car) of a car manufacturer's database has different values (e.g., different colors), a secondary index by color can be defined that shows all the different types of cars that come in that color.
A further difficulty in managing hierarchical data is that logical relationships may exist between data elements in the same database or between specific data elements in different databases. For example, an employee database may have a relationship to a salary segment in a payroll database. Logical relationships in the data can be defined such that it is possible for an application program to easily find all employees with a given salary or to find the salary of all employees with the same name.
Direct pointing between segments in the same or different databases (in this case, the relative byte address within the data set(s)) is easily managed. This direct relative byte address is meaningful within the context of the definition for the data relationship. The definition of this relationship provides the mechanism for using a relative byte address (RBA) in the correct database data set to find the related data.
When data is moved, as would be done when a database is reorganized, all indexes and logically related segments that have direct relative byte pointing into the data being moved must be updated with the new or current relative byte address for the targeted data before the data can be used again.
Presently, it takes a significant amount of time, and multiple steps, to reorganize a database that consists of direct pointers to other data elements in the database since the direct pointers have to be updated after a reorganization. The database management system does not know in advance which data element the pointer is going to point to, so the database management system has to go back later and update the pointers after the reorganization. Reorganization of such a database is a multi step process and is very time consuming.
Even though a database having direct pointers creates a time consuming reorganization process, the actual execution time for a database having direct pointers is greatly enhanced. As such, there is a great need, by customers, to be able to use direct pointing to other data elements especially in hierarchical databases such as IBM's IMS DL/I system. Customers greatly desire the ability to express logical relationships between data elements by using direct pointers. Direct pointers give high execution performance.
However, as alluded to above, using direct pointers results in a multi-step reorganization process which causes the database to be gone through multiple times before the database is ready to be used after a reorganization. These multiple steps include 1) running a pre-reorganization utility that examines the structures and determines what has to be done, 2) running a scan utility for data that is not being reorganized but is related to data that is being reorganized, including a reload operation, and 3) running a prefix resolution utility and a prefix update utility.
More specifically, the pre-reorganization utility looks at the database definitions and makes a list of all data elements that are impacted by this planned reorganization of one or more databases. The necessary steps are then scheduled. To reorganize the database, the database is unloaded. The data is read in logical order and the data is collected, in effect, in a clustered format. As a result, when it is reloaded, it is a high speed operation to put the data back into the database in a clustered format with appropriate free space. Then, a work file is created having information about the old location and new location and other information of interest. In addition, scan utilities are run on one or more related databases which are not being reorganized but which are being pointed into or from the database being reorganized. The scan operation also creates work files. The work files are combined into a sort. The first sort is a prefix resolution where everything is sorted into old location. When all of the data elements are together from the old location, the new location can be added. Another sort sorts them into their physical location in the database where they are after the reorganization, and the pointer updates for the prefix are in the appropriate order for the database. This is the prefix update step. Prefix resolution runs as a sort exit, as well as prefix update so they are not separate steps. In each case, the entire database is gone through, multiple times, in order to achieve a reorganization. It is these multiple operations on the database that keeps the database unavailable for a longer period of time.
High speed reorganization techniques are known. However, these techniques have merely optimized some of the above utility steps and have not eliminated any of these steps nor have they altered the basic reorganization process as described above. The high speed reorganization techniques make the utilities run faster, such as through I/O techniques and parallel multitasking.
One way to avoid this multiple step reorganization process for direct pointers is not to use direct pointing, but rather use symbolic pointing. If symbolic pointing is used instead of direct pointing, then there is no need to go back and update the symbolic pointing after reorganization because the symbolic name does not change. However, symbolic pointing has low performance during execution time.
In a symbolic link to data, the data is referenced through its current index by using its symbolic name and using a hash table or index that points to the current value. As part of a reorganization, this is always recreated. As such, there is an ability to always relate to data elements regardless of a reorganization. The problem is that symbolic links have extra I/O access in order to go to the B tree, through the index, find the pointer, and finally get to the data. Symbolic pointing has a problem of being slow at execution time because of this extra I/O.
For direct pointing and symbolic pointing, there is a trade off between reorganization performance and execution time performance. Therefore, there is a need to use direct pointing for high database performance while not needing to go back and update the pointers after the reorganization.
One type of a reorganization, called a fuzzy reorganization, involves reorganization by copying. This type of reorganization involves a reorganizer (the process that performs a reorganization) that records a current relative byte address (RBA) of a log. An RBA is a position in the log where a log entry can be written. At any time, the "current" RBA of the log is the position where the next log entry is written. An RBA is sometimes called a log sequence number (LSN). A log consists of a sequence of entries in a file (a region of storage), recording the changes that occur to a database. Then the reorganization copies data from an old (original) area to a new area in reorganized form. Concurrently, users can use the DBMS's normal facilities to read and write the old area, and the DBMS uses its normal facilities to record the writing in a log. The reorganizer switches the user's accessing to the new area. In many DBMS's, however, each entry in the log identifies a record by the record's record identifier (RID). As an inherent part of reorganization, the RIDs change. When applying the log (which uses old RIDs) to the new area (which uses new RIDs) techniques for overcoming problems of identification have to be used.
One method for finding data moved by a reorganization process uses the fully concatenated key of the target of a logical relationship or secondary index. However, this method requires unique keyed data. A method is needed that allows non-unique and un-keyed data to exist in the database.
It is also desirable to reduce the contention between other parts of the system that are not being impacted by the reorganization directly, such as a secondary index. In general, when a database is being reorganized that has alternate, i.e., secondary, indexes associated with a data element being moved, at the time the data element is being moved from the old location to the new location, there is an ability to update all of the indexes. This is because they are known at that time, and the index that is indexing into that point is based on the data itself that is being moved, i.e., contained in effect within the database record. In general, the secondary index is based on data values in that data element. In the IMS DL/I database, the database element can be indexed either on a data value within that element or any element in its dependency tree. During reload, all of the information needed is available to directly update each secondary index at the time of reload. However, updating a secondary index at this time creates recoverability difficulties if a reorganization fails. These recoverability difficulties are reduced if the index is updated after the relocated data elements have been successfully completed.
It is desirable to provide for recoverability and data integrity, such as in the event a reorganization fails. Database techniques exist for data integrity and recoverability. For example, changes to a database can be logged; and changed records can be locked making them unavailable to users until the operation is successful, the records are verified to be valid, and the records are then unlocked. If there is a failure, and the operation does not complete, the changes are backed out and the records unlocked. It is desirable to reduce the system management overhead by eliminating the need of performing the steps of logging, locking, and backing out changes in the event of an operation failure.