Storing redundant data can be inefficient. Although some forms of data storage redundancy, such as RAID (redundant arrays of inexpensive disks), in which the redundancy promotes reliability, can be useful, other forms of data storage redundancy can be wasteful and an inefficient use of storage resources. For example, in some computer systems, multiple hosts or processes frequently access the same data in the same storage system. Absent any measures to the contrary, each host or process causes the storage system to store the data in a location (e.g., an area on a disk) independently of any other host that may cause the storage system to store the same data in another location (e.g., another area on the disk or another disk).
Data de-duplication is a term that is commonly used to describe methods for reducing undesirable data storage redundancy. Data de-duplication can be employed in various computing system environments, and is especially useful in an environment in which data is backed up to a secondary storage system, as backed-up data typically comprises a large amount of redundant data, i.e., data that is duplicative of data that has been previously backed up. Networked e-mail is another environment in which data-de-duplication may be useful, as multiple users commonly have access to copies or duplicates of the same e-mail message.
Data de-duplication can be performed either in real-time, as the data is received for storage (i.e., “in-line”), or after the data has been stored (i.e., “post-processing”). Data de-duplication can be performed at the source, i.e., the host or filesystem that requires access to the data, or at the destination, i.e., the data storage system. Data de-duplication can be performed on a per-file basis or on blocks into which the data has been partitioned. In block-level de-duplication, the blocks can be of fixed size or variable size. Each of these data de-duplication parameters has advantages and disadvantages.
Data de-duplication methods fall into one of two main categories: hash-based or byte-level delta. Hash-based data de-duplication involves partitioning the data into blocks or segments and applying a cryptographic algorithm (colloquially referred to as a “hash” algorithm) to each data segment to produce a hash code or identifier that identifies the segment. Multiple references to this hash code can be stored to accommodate the multiple instances in which various hosts or processes reference the data identified by the hash code, but only a single copy of the data segment itself is stored. Efficiency is achieved because less storage area is required to store the hash codes and multiple references thereto than to store multiple copies of the data itself. Hash-based data de-duplication is commonly performed in-line, i.e., as data is received for storage. As each segment is received, it can be determined whether it is duplicative of data already in storage by applying the hash algorithm and comparing the hash code to those that have been stored. A strong hash algorithm minimizes the likelihood of collision, i.e., that two different data segments will yield the same hash code. However, a strong hash algorithm can inefficiently consume computation (i.e., central processing unit or CPU) resources. Also, providing a unique hash code for every unique data segment requires storage and retrieval of a large number of hash codes and references thereto, thereby inefficiently consuming storage resources. Each hash code itself must be large (i.e., many bytes long) to uniquely identify each unique data segment.
Byte-level delta data de-duplication involves comparing multiple versions of data over time and storing only the byte-level differences (i.e., delta) that occur between versions. Byte-level delta data de-duplication is commonly performed as post-processing, i.e., after the data has been stored on disk. While byte-level data de-duplication does not generally tax computation or storage resources, it can be very slow if large amounts of data must be compared.