The present invention relates generally to a method, system, and computer program for tape drive data reclamation. More particularly, the present invention relates to a method, system, and computer program for tape drive data reclamation using only a single tape drive.
Hierarchical storage management (HSM) is a known technology that realizes efficient use of a limited storage capacity. HSM is a scheme for arranging data that is frequently referred to in a high-speed and high-cost primary storage unit, such as RAID and SSD, and arranging data that is referred to less frequently in a low-speed and low-cost secondary storage unit. HSM may be implemented, for example, in IBM® products such as the IBM System Storage Virtualization Engine TS7700 and IBM Spectrum Archive™ Enterprise Edition.
A state where a certain piece of data is only stored in a primary storage unit is called a “resident” state, a state where a certain piece of data is stored not only in the primary storage unit but also in a secondary storage unit is called a “pre-migrated” state, and a state where a certain piece of data is only stored in the secondary storage unit is called a “migrated” state. For example, all pieces of TS7700 data are first stored in the primary storage unit and thus placed in the resident state. After several minutes, the pieces of data are copied to the secondary storage unit and thus placed in the pre-migrated state. The pieces of data will then be fully moved to the secondary storage unit when the system has only a very little disk space remaining; thus placing the pieces of data in the migrated state.
Storage products such as the IBM System Storage Virtualization Engine TS7700 and IBM Spectrum Archive™ Enterprise Edition employ a magnetic tape as the secondary storage unit. When a certain piece of data is written to a magnetic tape, which is a sequential-access medium, and the same piece of data is subsequently updated, the piece of data that has been updated is appended to the end of the tape while the previous data is handled as an invalid area. When updates to the data frequently occur, the proportion of the invalid area increases, causing relative decrease in the capacity of the tape.
As a scheme for solving this problem, a technique called reclamation is known. Reclamation is a technique of only reading valid data from a tape that includes an invalid area and writing the valid data that has been read to another tape. Reclamation requires two tape drives, for the source tape from which the target data should be read and the destination tape to which the data that has been read should be written should be simultaneously accessed. In recent years, due to the increase in magnetic tape capacity, reclamation processing takes longer and the two tape drives are occupied longer, which is now recognized as a drawback of the technique. For example, the data transfer rate to a tape compatible with an IBM® TS1150 tape drive is up to 360 megabytes per second. When data is to be read from a tape cartridge of 10 terabytes using two TS1150 tape drives to carry out reclamation, then the two drives may be occupied for about eight hours (=10 (TB)/360 (MB/sec)).
Further, in the case where data is moved between tape drives, it is possible to directly transfer the data without going through the primary storage of a host server, by using a mechanism called Extended Copy of the tape drives. In the case, where the two tape drives and the tapes are in good condition, and the data can be transferred at the same transfer speed, it is possible to transfer the data most efficiently. However, because reclamation is aimed to resolving invalid regions of the transfer source tape, data is read out on a memory of the host server in units of file using a “cp” command, i.e. a “copy” command, or the like, and written in the transfer destination tape without going through the primary storage, i.e. without using Extended Copy. Therefore, because it is necessary to recreate metadata, such as file name, and file location on the tape, of the file for the transfer destination tape, it takes longer time than it would in Extended Copy. Also, reclamation is usually performed when the number of fragments (fragmented data) of transfer source data becomes large or a tape is worn out as a result of being frequently used, or in the case when data is moved to a newer generation tape. Therefore, readout speed and transfer speed from the transfer source tape are often inferior to write speed and transfer speed of the transfer destination tape. This results in the problem wherein the write destination tape drive tries to transfer data at lower speed than it is capable of to synchronize the transfer speed with the source tape drive, and the readout speed of the tape drive on the readout side becomes even slower due to a difference in transfer speed between the drives. Further, if reclamation is performed for the purpose of evacuating data from a worn tape, it takes time as a result of repeating the readout of data which is difficult to be read out, and in the worst case, data cannot be read out and lost in the process of reclaim.
Thus, a solution for solving the problems of prolonged and inefficient data reclamation associated with using multiple tape drives is needed.