Data de-duplication (often called “intelligent compression” or “single-instance storage”) is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data de-duplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only one MB.
Data de-duplication offers other benefits. Lower storage space requirements will save money on disk expenditures. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups. Data de-duplication also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery.
Data de-duplication can generally operate at the file, block, and even the bit level. File de-duplication eliminates duplicate files (as in the example above), but this is not a very efficient means of de-duplication. Block and bit de-duplication looks within a file and saves unique iterations of each block or bit. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved, the changes don't constitute an entirely new file. This behavior makes block and bit de-duplication far more efficient. However, block and bit de-duplication take more processing power and uses a much larger index to track the individual pieces.
However, there are significant performance problems with conventional de-duplication schemes as the amount of data being stored continues to grow. For example, as the amount of data being stored within a given data storage system continues to expand, the mechanisms supporting conventional de-duplication schemes (e.g., the indexes) cannot adequately scale to meet the additional needs. For example, when the indexes are too large to be cached entirely into memory/RAM, lookups on the indexes incur one or more disk accesses, and this slows down write and read access to the de-duplication store significantly. The indexes used by conventional de-duplication schemes are proving to be severe limits on the ability of such schemes to handle large data sets. The conventional de-duplication schemes cannot scale to handle the many petabytes of data being generated by today's enterprises.