Data storage is a central part of many industries that operate in archival and compliance application environments, such as banks, government facilities/contractors and securities brokerages. In many of these environments, one or more storage systems are used to store selected sets of data, e.g., electronic-mail messages, financial documents and/or transaction records, in an immutable manner, possibly for long periods of time. Typically, data backup operations are performed on the storage system to ensure the protection and restoration of such data sets in the event of a failure. However, backup operations often result in the duplication of data on backup storage resources, such as disks, causing inefficient consumption of the storage space on the resources.
One form of long term archival storage is the storage of data on electronic tape media. Noted disadvantages of physical tape media include a slow data access rate and the added requirements for managing a large number of physical tapes. In response to these disadvantages, several storage system vendors provide virtual tape library (VTL) systems that emulate tape storage devices using a plurality of the disk drives. In typical VTL environments, the storage system serving as primary storage performs a complete backup operation of the storage system's data sets (e.g., in the form of backup data streams of a file system or other data store) to the VTL system. Multiple complete backup operations may occur over time thereby resulting in an inefficient consumption of storage space on the VTL system. It is thus desirable to reduce and/or eliminate duplicate data on the storage resources, such as disks associated with a VTL system, and ensure the storage of only single instances of data to thereby achieve storage compression.
One technique to eliminate duplicate data (data de-duplication) is described in U.S. Pat. No. 8,165,221, entitled SYSTEM AND METHOD FOR SAMPLING BASED ELIMINATION OF DUPLICATE DATA, by Ling Zheng, et al, the contents of which are hereby incorporated by reference. In such a data de-duplication system, the data may be replaced with a descriptor list or other set of partially ordered data, such as, e.g., a plurality of records, each of which describes a segment of the data. For example, if the data to be stored is ABCDA, the data may be replaced with a descriptor list as {L(A), L(B), L(C), L(D), L(A)}, where L(X) signifies the location of data segment X within a data store utilized by the system. Although the exemplary descriptor list references the location of data segment A twice, only one copy of segment A is actually stored within the data store, thereby resulting in a savings of storage space.
In a typical VTL environment, the data set may be measured in large quantities, e.g., gigabytes and/or terabytes. One disadvantage of using partially ordered data sets, such as descriptor lists in such an arrangement is that the descriptor lists may grow to the order of tens of megabytes. Depending on how often a backup operation is performed to the VTL system, the descriptor lists may consume a substantial amount of storage space. Furthermore, input/output operations required to read/save the descriptor lists may have a detrimental effect on the VTL system. Compression of the descriptor lists using conventional compression techniques, such as LZW, GZIP, etc. often has a minimal effect as these compression algorithms are designed to work on text files.