1. Field Of The Invention
The present invention is directed generally to data storage in a computerized data processing system, and more particularly, to a recycling process in which valid data sets remaining on old data storage volumes are consolidated to new data storage volumes in order to reuse the old data storage volumes.
2. Description Of The Related Art
A common practice in computer data processing environments (from very small home computers to very large enterprise computers) is to store data sets (e.g. data and program files) onto removable, reusable serial media, such as a tape cartridge (volume). Usually, these data sets are copied or moved to the serial media from a direct access storage device (DASD) such as a disk drive. The process of managing space in primary storage (usually DASD) and secondary or tertiary storage (usually tape) is known as space management. In the MVS (multiple virtual storage) operating system for mainframe computers (MVS is a trademark of International Business Machines), space management for primary storage consists of: partial release, expiration, and migration (in that order). Space management for secondary storage consists of invalidation and recycle. Partial release is the act of releasing unused tracks at the end of a data set. Because MVS requires that all DASD data sets be pre-allocated, there is often the case where a data set is allocated larger than is necessary to hold the data. Partial release returns the unused portions of the data set to the system. Expiration is the act of removing old and unreferenced data sets from the system. When a data set expires, it is deleted from where it resides: Either primary storage or secondary storage. Migration is the act of moving infrequently used data sets from primary storage to secondary storage. The space management system controls when a data set gets migrated to secondary storage. Invalidation is the act of marking data so that it will not be copied during a subsequent recycle. Recycle is the act of copying all remaining valid data from a serially recording media to some other storage media.
The goal of space management is to make sure that storage is being used as effectively as possible. This is accomplished by establishing a storage hierarchy and moving data sets in the hierarchy according to their management classes. The procedures implemented by space management systems are similar to, but distinct from, the procedures known as backup and archival. Backing up is the act of periodically copying data sets, or portions thereof, from primary storage to backup storage in order to create one or more backup versions of the data sets which can be recovered following a disaster event. Archival is the act of saving a specific version of a data set (e.g., for record retention purposes) for an extended period of time. The data set is placed in archive storage pursuant to a command by the system administrator or system user. It does not occur automatically as is usually the case in a backup operation or a migration operation.
When data sets are written from primary storage to secondary storage, each secondary storage volume is written to capacity, removed and replaced with a new volume so that write operations can continued. To access an individual data set, the secondary storage volume is remounted and a high speed position is performed to the correct location.
A secondary storage volume can contain thousands of individual data sets. Over time, data sets become invalid because the data sets are either deleted by their owners, or automatically expired by the system based on age or usage criteria. However, when the secondary storage volume is stored on serial media, which does not have pre-defined, fixed length tracks or sectors as DASD does, the space occupied by the invalid data cannot be reclaimed by overwriting that space with new valid data. Because of variable data set size, compression and imperfections in the media itself, writing over the space freed by the deleted data with new data would run the risk of over-writing adjacent, still valid, data sets. A serial storage volume will thus decline in percentage of valid data, potentially starting out being totally filled with valid data (100% valid), declining to partially valid (1-99% valid) and ultimately becoming entirely invalid (0% valid). Because partially valid volumes are nonetheless "full" in the sense of not being able to accept more data, their storage capacity (of valid data sets) is diminished and it becomes necessary to reclaim the use of the space occupied by the no longer valid data sets.
To regain the use of serial storage media for storing new data, a reclamation process has evolved that transfers valid data from many full, no longer 100% valid volumes, to create a set of newer, fewer, 100% valid volumes. This is the recycle process referred to above. The recycle process moves all valid data from a number of full, partially-valid volumes and rewrites the data onto newer, fewer volumes, thus freeing up a number of volumes for reuse at the expense of using volumes containing only valid data. For example, if there existed ten volumes each with only 10% of the capacity valid, all of this valid data could be sequentially written onto a single volume, thus producing ten empty volumes and a single full volume. Recycle operations may be single tasking or multitasking. In a single tasking recycle operation, two tape drives are allocated. Data is read from a single partially valid input volume mounted on one drive and emptied into a single output volume mounted on the other drive. When the input volume is emptied of all valid data, another input volume is mounted and its valid data is written onto the single output volume. In a multitasking recycle operation, multiple input-to-output drive pairs are concurrently active. With allowance for potential inefficiencies, a multitasking recycle operation yields an enhancement in recycle performance in roughly the proportion of the number of tasks assigned to perform the recycle.
Another aspect of multitasking recycle operations is the concept of a "connected set." A connected set is one or more serially recorded volumes connected by valid spanned data sets. As a volume is being written to (i.e., filled) it sometimes occurs that a data set is partially written to one volume and completed on another volume. This is caused when the first volume is filled and insufficient space exists to complete the writing of the data set to the first volume. The data set is completed on a second volume and the volume pair represents a connected set of volumes. The connected set size (in number of volumes) could grow to any number of volumes if the second, third, etc. volumes are each forced to complete the writing of a data set on the next volume. The data set that is partially written on two or more volumes is called a spanned data set. Whenever a spanned data set is marked invalid, the connected chain is broken and two new connected sets are potentially created, i.e., those volumes still connected by valid spanned data sets to the volume where the spanned data set started are one connected set, and those volumes still connected by valid spanned data sets to the volume where the spanned data set terminated are another connected set. A typical recycle operation treats a connected set as an entity and determines percent valid properties as a function of the total connected set capacity.
One of the drawbacks of conventional recycle operations is that each task owns a dedicated input and output drive. This results in idle, unused drive capacity. For example, if thirty volumes are input and the data is consolidated over to ten output volumes, the output drive would be idle during the times that the input drive is mounting volumes, positioning to the next valid data set, and de-mounting the volumes. Likewise, the input drive would be idle during the times that the output drive must rewind, de-mount and then mount the next output volume. This delays the recycle process and renders the idle drives unavailable for reassignment to other secondary storage operations.
Accordingly, one cannot rely on conventional recycle processes when the efficient transfer of secondary storage data is required. What is needed is an efficient method for improving recycle throughput so that recycle operations are performed more rapidly and the drives used therein are released for other operations as soon as possible.