This invention relates to databases and database management systems and, in particular, to hierarchical database maintenance.
Ideally, when data is stored in a database, it should be stored in physical proximity to other data to which it is related. Such proximal storage will reduce disk traffic and I/O access frequency. Over time, however, as data is deleted and added to the database, data that should be physically proximal or xe2x80x9cclusteredxe2x80x9d becomes dispersed across the database and storage vehicles (DASD, for example) on which the database is resident.
Some database systems, such as IBM""s Information Management System (xe2x80x9cIMS DL/Ixe2x80x9d or alternatively, xe2x80x9cIMSxe2x80x9d), allow construction of data sets with free space distributed through the storage space. IMS provides that ability to specify that a portion of each block or control interval be reserved as free space, during the initial load or reorganization of a database. Every n-th block may also be reserved in entirety. There are two free space parameters that specify the percentage of free space for each block and the other specifies the frequency of completely free blocks.
Free space can be helpful or harmful. It will increase the amount of disk space required and may result in extra I/O""s. The challenge is to allocate the right amount of free space during database design so that disk space is minimized while the likelihood of fitting additions in the optimum block is maximized. The volume of additions must be estimated as well as the distribution of those additions. Too much free space is an inefficient use of resources, and too little results in increases in seek time and increased I/O operations.
Databases express relationships between units of data. In a hierarchical database system, such as IBM""s IMS, data is organized in a tree-like structure. Each unit of data is known as a segment and related segments are together known as a record. From a root segment, all other segments in the record bear a direct or indirect subordinate relationship. The root segment of a record is established by the database description or definition process (xe2x80x9cDBDxe2x80x9d). A segment which depends immediately from the root is a child segment and a child segment may be a parent to segments further from the root.
Over time, databases tend to enlarge unevenly so that some groups or xe2x80x9cclustersxe2x80x9d of related data increase in population more quickly than others. When data is inserted in an IMS data base, IMS uses a documented strategy that tries to place a segment to be inserted as close as possible to segments to which it is related. IMS first tries to place the segment into the block where related segments reside. If that is not possible, IMS tries to place the segment at least in the same track as related segments. If that is not possible, placement in the present, previous or next cylinder is attempted, and so on until it has searched for room both ahead of and behind the placement area. The available placement area is defined by a xe2x80x9cSCAN cylindersxe2x80x9d statement specified when the data base is generated during the DBD process. If still there is no available room, the segment is placed at the end of the data set in an area known as xe2x80x9coverflow.xe2x80x9d The overflow area is not contiguous with the root addressable area (xe2x80x9cRAAxe2x80x9d). If overflow becomes full, IMS will attempt to place the segment anywhere in the data base that room can be found. If there is insufficient free space early in the placement process, data becomes physically dispersed from the data to which it should be proximal. As data becomes dispersed, the read disk head must travel further to access that data and wait longer to complete the random seek on a particular track. Consequently, periodic rearrangement of the no longer clustered data in the database can result in significant improvement in database performance including increased storage efficiency and improved operational speed. Such rearrangement is known in the art as xe2x80x9creorganization.xe2x80x9d
Basic IMS access techniques such as Hierarchical Sequential Access Method (HSAM) use sequential access to find a particular segment. The access request starts at the first root, then examines each root sequentially until the destination root is found and then searches up the tree according to certain rules until the target segment is found. Later IMS access techniques developed as part of IMS Version II introduced the hierarchical direct (HD) access methods. Hierarchical direct access methods such as the Hierarchical Indexed Direct Access Method (HIDAM), for example, allow indexed access to any root segment based upon its xe2x80x9ckeyxe2x80x9d to its offset from the beginning of the data set to the prefix of the root segment of the target record. This requires that a segment in an HD database never move within a dataset until the data base is reorganized.
Even though physical adjacency between logically related segments improves database efficiency, the functional or logical relationship between segments in an HD access IMS database is not expressed through the physical adjacency of those segments in the database. The segments within a data base record in an HD IMS data base are connected using four-byte Relative Byte Address pointers (xe2x80x9cRBAxe2x80x9d). A RBA pointer is a four-byte field in a segment that designates the starting position of the destination segment relative to the beginning of the dataset. Fixing segment location makes it feasible to use pointers from one segment to other specific segments in other data bases or partitions and from secondary indexes. Pointer use in segments is also valuable within a data base to connect a parent segment to the first or first and last occurrence of each segment type. Pointers can also be used to establish secondary indexes through which an alternative organizational hierarchy perspective or an entry point for the record alternative to the root can be constructed.
Logical relationships can be established to logically link two segments which exist in separate physical databases, partitions or data sets. A logical child is used to construct the logical linkage between the two segments intended to be related. Multiple logical relationships can be constructed to create a hierarchical structure consisting of segments from multiple physical databases to create an alternative logical view of related data which can be seen by an application as a hierarchical database.
In the two segments to be related, the logical child has two parents; a physical parent and a logical parent. The leftmost field in the logical child contains the concatenated key of the logical parent that gives a symbolic address for the logical parent. An optional direct RBA pointer can be contained in the segment prefix. Thus, if an access request seeks the logical parent, but knows only the location of the physical parent, the path to the logical child (which is the child of the physical parent) is taken where, upon arrival at the logical child, the address of the logical parent is found through the key or pointer in the logical child.
Thus, many useful, logically-ruled organizational structures are dependent upon pointers amongst and between data elements to maintain logical interrelationships and indexes which, although they differ from the physical relationships of the data, depend for their continuance upon the awareness of the physical siting of any data into which pointers direct the process flow. Further, pointers allow entry to a data base at any level of the hierarchy or any instance of a segment type without traversal of the hierarchical path. If a data segment which had been pointed to by the relative byte pointer in another segment is physically moved, established secondary indexes and logical relationships are destroyed unless the new location of the moved target data can be determined. Consequently, two countervailing trends contend in IMS reorganization. The need for operational efficiency dictates periodic reestablishment of physical data clustering. But, because reorganization moves data to reestablish physical grouping and datamovement is time consumptive, the advantages of reestablished physical order come at a concomitant data base downtime price.
In conventional reorganization of an IMS database, multiple time-consuming steps are required to resolve the logical remapping required by the physical segment movement implicit in reorganization. For example, current reorganization technology does not determine new RBA""s for reloaded segments until that segment is actually reloaded into the new dataset. Such RBA determination in the multi-step process of prior art reorganization results in significant subsequent time-consuming RBA resolution overhead.
Initially, in conventional reorganization, the data base to be reorganized (target) is unloaded. As the data is then loaded into a new data set to restore physical order, a record is written to a WF1 type file for example which notes the existence of this segment and its RBA in the new data set. The work file may, in some cases, also note secondary relationships.
Data bases or independent partitions which contain segments to which segments of the target data base are related are scanned by another utility such as DB Scan for example, to determine the presence and position of any such logically related segments. This information is written to a work file similar to the one generated by the load process. Similar scans are run against any other data bases which include segments to which segments of the reorganized data base bear a logical relationship.
After all databases being reorganized have been re-loaded and any other databases participating in logical relationships, but not being reorganized are scanned, the typically lengthy process of prefix resolution can begin. This is sometimes done in serially or in parallel groups of operations. All the work files from the various load and scan processes, such as the WF1 files, are input to the prefix resolution process and sorted. After sorting, logically related segments from the respective databases are matched and yet another work file is created that will be used to update the segment prefixes and pointers in a subsequent prefix update step.
Segment prefixes are updated with the new RBA of their counterparts in related databases. Items updated are logical parent counters and, if virtual pairing is used, xe2x80x9clogical child first and last pointersxe2x80x9d, logical child""s logical parent pointers and when virtual pairing is used, the logical twin forward and backward pointers. This process is run for each database in the relationship.
When a database is reorganized, the area being reorganized becomes unavailable and, therefore, the data resident in the area under reorganization becomes unavailable. As the multiple steps conventionally required. for reorganization are executed, the area under reorganization can be unavailable for lengthy periods which can, on occasion, last for days. Consequently, techniques for rapid reorganization of databases have significant practical and financial value. Therefore, what is needed is a system and method for more rapid database reorganization.
The present invention provides a system and methods for rapid unloading and reorganization of hierarchical databases. The system and method of the present invention may be used in unloading segments to an external file for example and another method of the present invention includes calculation of the RBA for the segment before it is reloaded into the new dataset. The characteristics of the output datasets are known before the first segment is actually moved from the dataset to be reorganized. The reorganization step known as xe2x80x9cprefix resolutionxe2x80x9d is, therefore, eliminated with a consequent significant reduction in reorganization elapsed time.
In a preferred embodiment, all overflow and a window that is a DBD defined xe2x80x9cSCAN cylindersxe2x80x9d of blocks are read into memory. After this, unloading of database record segments by RBA may commence. As unloading proceeds, the window moves ahead while expanding until, in a preferred-embodiment, it has expanded to include the block from which the unload is proceeding plus a DBD defined SCAN cylinders of blocks forward from that point as well as a SCAN cylinders of blocks behind that point. For the following exposition, as the unload is underway, a xe2x80x9cscan cylindersxe2x80x9d window of blocks refers to this entire window. As the database is unloaded, most of the RBAs of the segments unloaded resolve to the areas where IMS normally places these segments, i.e., a block already read from the dataset, a block in the scan cylinders window, or the overflow area. Therefore, segments unloaded will have been read into memory in the present invention. In the rare instance where IMS has placed a segment to be unloaded in a location other than dataset overflow or within the scan cylinders window, a random I/O can be performed to read that segment""s block so that as such a segment is unloaded, that segment has been read into memory. Preferably, the reading of sequential blocks stays about scan cylinders ahead of the unload. This inhibits real memory over-commitment and waits for blocks to be read.
As a segment is unloaded, its space is converted to free IMS space and when appropriate, combined with adjacent free space already in the block. Thus about xe2x80x9cscan cylindersxe2x80x9d behind the unload point in the data base, all of the segments in a block will have been converted to free space making the block one unit of free space. There will then be no further references to this block and it may be page released back to the OS memory management. Thus no paging subsystem I/O occurs. In those instances where data remains in the block at the conclusion of the unload, an error is noted that would otherwise have gone unnoticed. In other instances, when an attempt to unload a segment residing in free space is made, another type of error that would heretofore have gone unnoticed is found, namely, an RBA pointer loop.
When a new database is populated with segments from a disorganized database, the invention provides methods for advance calculation of what the segment RBA is going to be in the database to be reloaded. The space search algorithm used in the actual load of the new data set is used in a proxy load of a proxy dataset. The proxy dataset consists of proxy blocks. Each proxy block in the proxy data set is represented by a counter that denotes the space available in the proxy block.
Segments are unloaded in an algorithmic order that corresponds to a hierarchical relationship in the database. This corresponds to the state of initial load when segments within a record are physically stored in hierarchical sequence. In alternative embodiments, alternative algorithms representative of other logical hierarchies may also be used. As the segments are unloaded, rather than a literal load of the proxy dataset, the length of each segment is sequentially deducted from the proxy block counter. Alternative embodiments may use counters that can be accumulated to contemplate the size of the segments. Because, in a preferred embodiment, the proxy load uses the same algorithm that will be used to actually populate the new reorganized dataset, at each proxy segment load, the counter may be used to calculate the RBA the segment will exhibit in the reorganized dataset.
The indicated future or new RBA is recorded. In a preferred embodiment, the future RBA is stored to a table. Also stored in the table is the segment""s current RBA. The table is indexed by hashing (preferably) or sorted (alternatively) by current (soon to be prior) RBA. For databases that contain segments logically related to segments in databases to be reorganized, a scan parses other datasets or databases for segments that participate in logical relationships with segments in the dataset under reorganization. The logical parent or logical child RBA pointer of such segments is used to search the RBA table. When a match is found, the RBA in the segment""s prefix is replaced with the corresponding new RBA found in the table. For segments in databases being reorganized, the RBA for segments in logical relationships is used to search the RBA table. When a match is found, the new RBA is placed in the segment pointer field in place of the old or prior RBA.