The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for performing de-duplication as part of other routinely preformed processes.
In computing, data de-duplication is a specialized data compression technique for eliminating redundant data in a storage system. The technique is used to improve storage utilization and may also be applied to network data transfers to reduce the number of bytes sent across a link. In the de-duplication process, data objects or chunks are identified and stored during a process of analysis. As the analysis continues, other objects are compared to the stored copies and, whenever a match occurs, the redundant object is replaced with a reference that points to the already stored object or chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is a factor of the data characteristics and file size), the amount of data that must be stored or transferred may be greatly reduced.
Accordingly, storage-based data de-duplication inspects large volumes of data to identify large sections—such as entire files or large sections of files—that are identical, in order to store only one copy of the file. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data de-duplication, only one instance of the attachment is actually stored. The subsequent instances are referenced back to the saved copy for de-duplication, resulting in a compression ratio of roughly 100 to 1.