In computing, data de-duplication is a specialized data compression technique for eliminating redundant data in a storage system. The technique is used to improve storage utilization and also can be applied to network data transfers to reduce the number of bytes sent across a link. In the de-duplication process, data objects or chunks are identified and stored during a process of analysis. As the analysis continues, other objects are compared to the stored copies and whenever a match occurs, the redundant object is replaced with a reference that points to the stored file. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is a factor of the file size), the amount of data that must be stored or transferred can be greatly reduced.
One method for de-duplicating data relies on the use of cryptographic hash functions to identify duplicate segments of data. If two different data sets generate the same hash value, this is known as a collision. The probability of a collision depends upon the hash function used. If a collision occurs, the system knows that it has already stored this data. Instead of re-storing the data, the system will replace the redundant data with the reference to the stored data.
Accordingly, storage-based data de-duplication inspects large volumes of data to identify large sections—such as entire files or large sections of files—that are identical, in order to store only one copy of the file. For example a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data de-duplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for de-duplication, resulting in a compression ratio of roughly 100 to 1.