Data storage is a central part of many industries that operate in archival and compliance application environments, such as banks, government facilities/contractors, securities brokerages, and so on. In many of these environments, it is necessary to store data (e.g., electronic-mail messages, financial documents, transaction records, etc.) for long periods of time. Typically, data backup operations are performed to ensure the protection and restoration of such data in the event of a storage server failure.
One form of long term archival storage is the storage of data on physical tape media. A noted disadvantage of physical tape media is the slow data access rate and the added requirements for managing a large number of physical tapes. In response to these noted disadvantages, some storage system vendors provide virtual tape library (VTL) systems that emulate physical tape storage devices. When a storage server writes data (e.g., a backup file) to a virtual tape of a VTL system, the VTL system typically stores the data on a designated region of one or more disks that correspond to the virtual tape specified in the write command received from the storage server. Typically, space is allocated on the disks in large contiguous sections referred to as block extents, each block extent includes multiple contiguous data blocks. As each block extent of the virtual tape is filled, the VTL system selects a disk region to which to assign the next block extent. In conventional VTL systems, each block extent is fully and exclusively owned by the virtual tape to which it is allocated.
Objects in the VTL system called data maps are used to track the block extents that make up each virtual tape. FIG. 1A illustrates an example of a data map of a conventional VTL system comprising four virtual tapes. In the illustrated embodiment, the data map 100a includes data blocks 101-Z. To facilitate description, empty data blocks represent unused storage space, while data blocks having a fill pattern represent used storage space. The pattern of each data block represents the contents of the data block. Data blocks having the same pattern are duplicate data blocks (e.g., data blocks 104, 125, 136, 145, 156, 157, and 176 are duplicate data blocks). Each of the used data blocks has been written to a designated region on disk that corresponds to one of the virtual tapes. To facilitate description, each used data block is labeled to identify the virtual tape to which the data block was written. For example, data blocks 101-106, 142-144, 156, and 182-185 have been written to regions on disk that has been allocated to virtual tape 1 (e.g., “TAPE 1”).
In typical VTL environments, a storage server performs a backup operation of the storage server's file system (or another data store) to the VTL system. These backups often result in the storage of duplicate data, thereby causing inefficient consumption of storage space on the VTL system. Data map 100a illustrates such inefficient consumption of storage space on a conventional VTL system (i.e., there are 48 used data blocks yet only 8 unique fill patterns).
A technique commonly referred to as “deduplication” may be used to reduce the amount of duplicate data written to disk regions allocated to a virtual tape. Conventional deduplication techniques involve detecting duplicate data blocks by computing a hash value (“fingerprint”) of each new data block that is written to a virtual tape, and then comparing the computed fingerprint to fingerprints of data blocks previously stored on the same virtual tape. When a fingerprint is identical to that of a previously stored data block, the deduplication process determines that there is a high degree of probability that the new data block is identical to the previously stored data block. To verify that the data blocks are identical, the contents of the data blocks with identical fingerprints are compared. If the contents of the data blocks are identical, the new duplicate data block is replaced with a reference to the previously stored data block, thereby reducing storage consumption of the virtual tape on the VTL system.
Importantly, because each data block is fully and exclusively owned by the virtual tape to which it was written, deduplication in a conventional VTL system can be performed only on a per tape basis. That is, only duplicate data blocks written to the disk regions corresponding to the same virtual tape can be deduplicated. As a result, deduplication does not eliminate duplicate data blocks written to disk regions allocated to different virtual tapes.
FIG. 1B illustrates a data map 100b corresponding to the data map 100a illustrated in FIG. 1A of a conventional VTL system after a deduplication technique is performed. As illustrated in FIG. 1B, although the duplicate data blocks written to disk regions allocated to the same virtual tape have been consolidated into a single data block, deduplication does not eliminate duplicate data blocks written to disk regions allocated to different virtual tapes (i.e., there are 8 unique fill patterns, yet 30 used data blocks remain). As a result, despite deduplication techniques, conventional VTL systems inefficiently consume storage space.