A storage system typically comprises one or more storage devices into which information may be entered, and from which information may be obtained, as desired. The storage system includes a storage operating system that functionally organizes the system by, inter alia, invoking storage operations in support of a storage service implemented by the system. The storage system may be implemented in accordance with a variety of storage architectures including, but not limited to, a network-attached storage environment, a storage area network and a disk assembly directly attached to a client or host computer. The storage devices are typically disk drives organized as a disk array, wherein the term “disk” commonly describes a self-contained rotating magnetic media storage device. The term disk in this context is synonymous with hard disk drive (HDD) or direct access storage device (DASD).
Data storage is a central part of many industries that operate in archival and compliance application environments, such as banks, government facilities/contractors and securities brokerages. In many of these environments, it is necessary to store selected data, e.g., electronic-mail messages, financial documents and/or transaction records, in an immutable manner, possibly for long periods of time. Typically, data backup operations are performed to ensure the protection and restoration of such data in the event of a failure. However, backup operations often result in the duplication of data on backup storm age resources, such as disks, causing inefficient consumption of the storage space on the resources. One form of long term archival storage is the storage of data on electronic tape media. A noted disadvantage of physical tape media is the slow data access rate and the added requirements for managing a large number of physical tapes. In response to these noted disadvantages, several storage system vendors provide virtual tape library (VTL) systems that emulate tape storage devices using a plurality of the disks. In typical VTL environments, the storage system serving as the primary storage performs a complete backup operation of the storage system's file system (or other data store) to the VTL system. Multiple complete backups may occur over time thereby resulting in an inefficient consumption of storage space on the VTL system. It is thus desirable to eliminate duplicate data on the storage resources, such as disks associated with a VTL system, and ensure the storage of only a single instance of data to thereby achieve storage compression.
One technique for achieving a reduction in data duplication (de-duplication) is described in U.S. Pat. No. 5,990,810, entitled METHOD FOR PARTITIONING A BLOCK OF DATA INTO BLOCKS AND FOR STORING AND COMMUNICATING SUCH SUBBLOCKS, by Ross Williams, issued Nov. 23, 1999 (hereafter “the '810 patent”). The method described in the '810 patent first utilizes a rolling hash function to generate a plurality of sub-blocks of data. The rolling hash utilizes a fixed size window of data that results in a boundary being placed between two sub-blocks. Once a block of data has been partitioned into sub-blocks, the hash value of each sub-block is calculated to form a table of hash values. The hash table is then used to determine if a new sub-block is identical to any sub-block whose hash value has previously been stored in the hash table. To perform this determination, the new sub-block's hash value is calculated and compared with the values contained in the hash table. If the new sub block's hash value has been previously stored within the hash table, then the sub block identified with the stored hash value is considered identical to the new sub block. In such a case, the new sub block is replaced with a pointer to the previously stored sub block, thereby reducing the amount of storage space required for the sub block. A noted disadvantage of the technique described in the '810 patent is that it requires performance of an extensive number of computationally intensive hashing calculations, which may affect the overall performance of a storage/VTL system implementing such a method. Another noted disadvantage is that the hash table will become larger as the size of a data set increases and may not scale to large data sets such as terabytes or petabytes of data.
Another technique, as mentioned in DELTA STORAGE FOR ARBITRARY NONTEXT FILES by Chris Reichenberger, in Proceedings of the 3rd International Workshop on Software Configuration Management, Trondheim, Norway, 1214 June 1991 (June 1991), ACM, pp. 144-152, is to use the hashes of data to find the longest common data sequence. However, the above method to compute and compare the hashes of all possible substrings in a data set does not employ any sampling technique. Therefore, it is not scalable for large data sets.
A noted disadvantage of the above-described techniques is that de-duplication is performed as a two step process. The un-de-duplicated data is first stored, thereby requiring sufficient storage space. Secondly, the stored data is de-duplicated using one of the above-mentioned de-duplication techniques. The use of a two step de-duplication procedure introduces additional complexity into de-duplication systems. Furthermore, two step de-duplication increases the overall cost of ownership by requiring system installations to have sufficient space to initially store the un-de-duplicated data. Furthermore, the de-duplication procedure typically significantly increases the overall processing load on a main processor, such as a central processing unit (CPU), of the system performing the de-duplication. As a result, other processes may be impaired while such de-duplication is ongoing.