Data deduplication reduces storage requirements of a system by removing redundant data, while preserving the appearance and the presentation of the original data. For example, two or more identical copies of the same document may appear in storage in a computer and may be identified by unrelated names. Normally, storage is required for each document. Through data deduplication, the redundant data in storage is identified and removed, freeing storage space for other data. Where multiple copies of the same data are stored, the reduction of used storage may become significant. Portions of documents or files that are identical to portions of other documents or files may also be deduplicated, resulting in additional storage reduction.
To implement data deduplication, in one example, data blocks are hashed, resulting in hash values that are smaller than the original blocks of data and that uniquely represent the respective data blocks. A 20 byte SHA-1 hash or MD5 hash may be used, for example. Blocks with the same hash value are identified and only one copy of that data block is stored. Pointers to all the locations of the blocks with the same data are stored in a table, in association with the hash value of the blocks.
A remote deduplication appliance may be provided to perform deduplication of other machines, such as client machines, storing data to be deduplicated. The deduplication appliance may provide a standard network file interface, such as Network File System (“NSF”) or Common Internet File System (“CIFS”), to the other machines. Data input to the appliance by the machines is analyzed for data block redundancy. Storage space on or associated with the deduplication appliance is then allocated by the deduplication appliance to only the unique data blocks that are not already stored on or by the appliance. Redundant data blocks (those having a hash value, for example, that is the same as a data block that is already stored) are discarded. A pointer may be provided in a stub file to associate the stored data block with the location or locations of the discarded data block or blocks. No deduplication takes place until a client sends data to be deduplicated.
This process can be dynamic, where the process is conducted while the data is arriving at the deduplication appliance, or delayed, where the arriving data is temporarily stored and then analyzed by the deduplication appliance. In either case, the data set must be transmitted by the client machine storing the data to be deduplicated to the deduplication appliance before the redundancy can be removed. The deduplication process is transparent to the client machines that are putting the data into the storage system. The users of the client machines do not, therefore, require special or specific knowledge of the working of the deduplication appliance. The client machine may mount network shared storage (“network share”) of the deduplication appliance to transmit the data. Data is transmitted to the deduplication appliance via the NFS, CIFS, or other protocol providing the transport and interface.
When a user on a client machine accesses a document or other data from the client machine the data will be looked up in the deduplication appliance according to index information, and returned to the user transparently, via NSF or CIFS, or other network protocols. If a user decides to copy a document from a first location to a second location, for data management operations, for example, the entire data set must be retrieved from the deduplication appliance and sent back to the client machine. If the destination location happens to be the deduplicated appliance, the copy of the data in the second location will be deduplicated again, as new data to be backed up. This is cumbersome, and may require use of a lot of network bandwidth and CPU usage in the client and the deduplication appliance.