Data de-duplication reduces the overall amount of data storage required to represent and retain data by identifying duplicate portions of the data and replacing those duplicate portions with pointers to existing copies of that data.
Much of the voluminous amount of information stored, communicated, and manipulated by modern computer systems is duplicated within the same or a related computer system. It is commonplace, for example, for computers to store many slightly differing versions of the same document. It is also commonplace for data transmitted during a backup operation to be almost identical to the data transmitted during the previous backup operation. Computer networks also must repeatedly carry the same or similar data in accordance with the requirements of their users.
One method for reducing the redundancy of communicated and stored data is known as data de-duplication. Basically, data de-duplication attempts to identify identical portions of data within a group of blocks of data and use this identification to increase the efficiency of systems that store and communicate data.
However, present de-duplication engines may be susceptible to an attacker attempting to modify the identification information to try to corrupt or insert malicious data into the backup archives.
For example, if an attacker knew a certain executable was going to be backed-up, they could construct their own version of the executable with a malicious pay load. If they caused their version to have the same identifying characteristics as the original version, then once the malicious version was stored, any later legitimate version would not be stored. Instead, the de-duplication engine would throw away the valid data and provide a reference to the malicious data.
The drawings referred to in this brief description should be understood as not being drawn to scale unless specifically noted.