Deduplication has become the common term used to identify any technique that attempts to remove duplicate data from a system, either for the purpose of saving disk space or network bandwidth. A deduplicating file system, for example, stores only one copy of a file, even if the file exists under multiple distinct paths in the file system tree. There are a number of different techniques for accomplishing this deduplication in file systems over the years.
Network deduplication, in contrast, refers to eliminating transfers of data between two parties if those transfers contain content that has already been transferred in the past. The first system to deduplicate data over a network was Muthitacharoen's low bandwidth file system (LBFS) described in “A Low-Bandwidth Network File System,” In Proceedings of ACM SOSP, 2001. It is a client-server protocol in which both sides keep an index of the SHA-1 hashes of all of the file system blocks of which they are aware. To download a file from the server, a client first asks the server for only the SHA-1 hashes of the blocks of the file in question. The client then requests the data for only those blocks for which it does not already know the content (determined by checking the client's index). Likewise, before uploading new data to the server, the client sends only the SHA-1 hashes of the relevant blocks, and the server responds with a list of the blocks for which it does not already know the content. The client then uploads only the content of these unknown blocks.
A network proxy is a machine that intercepts network packets from one machine and possibly transforms them before forwarding them to their intended recipient. Such transformation may include modifying existing packets, dropping packets, or fabricating completely new packets. A hypertext transport protocol (HTTP) proxy is one example of a network proxy. A proxy can be either explicit or transparent, the distinction being whether one or both communication endpoints are explicitly configured to use the proxy or not. Network proxies may also be paired, with one proxy on either end of a connection. A virtual private network (VPN) can be implemented using two such proxies, with one proxy encrypting traffic from the local network before transmitting it into the public network, and the other proxy decrypting traffic from the public network and transmitting it on the remote network.
A deduplicating network proxy is one that, paired with another deduplicating proxy on the other end of a connection, attempts to reduce the transfer of duplicate data across the network between them. For example, assume Alice and Bob are separated by a pair of deduplicating proxies. Alice transmits a file between her computer and that of a friend, Bob. Bob changes one byte of the file and sends it back to Alice. For the second transfer, the deduplicating proxy closest to Bob will (ideally) only send to its peer proxy a notification that a transfer should take place and the value and offset of the byte Bob actually changed. The proxy closest to Alice will then replay the entire transfer to Alice, including the changed byte. For a large file, this differential transfer can conserve a great deal of network bandwidth between the two proxies.
Rabin fingerprinting is a technique for incrementally generating hashes of n-byte substrings of a large file, which has been described in “Fingerprinting by Random Polynomials,” Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981. In a naive implementation of LBFS that used fixed-size blocks, the insertion of a single byte at the beginning of a file would change the contents of all subsequent blocks (shifting them over one place), and thus change all of their SHA-1 hashes. As such, if a user were to download a file, insert a byte at the beginning, and upload the result, this naive version of LBFS would be unable to deduplicate the transfer. LBFS uses Rabin fingerprinting to identify similar substrings of network traffic in a way that was not subject to this offset problem. However, such techniques have not been very effective for optimizing WAN traffic.