Technical Field
This application relates generally to data communication over a network.
Brief Description of the Related Art
Distributed computer systems are well-known in the prior art. One such distributed computer system is a “content delivery network” or “CDN” that typically is operated and managed by a service provider. The service provider typically provides the content delivery service on behalf of third parties (customers) who use the service provider's shared infrastructure. A distributed system of this type is sometimes referred to as an “overlay network” and typically refers to a collection of autonomous computers linked by a network or networks, together with the software, systems, protocols and techniques designed to facilitate various services, such as content delivery, application acceleration, or other support of outsourced origin site infrastructure. A CDN service provider typically provides service delivery through digital properties (such as a website), which are provisioned in a customer portal and then deployed to the network.
Data differencing is a known technology and method to leverage shared prior instances of a resource, also known as versions of data within a shared dictionary in compression terminology, between a server and a client; the process works by only sending the differences or changes that have occurred since those prior instance(s). Data differencing is related to compression, but it is a slightly distinct concept. In particular, intuitively, a difference (“diff”) is a form of compression. As long as the receiver has the same original file as a sender, that sender can give the receiver a diff instead of the entire new file. The diff in effect explains how to create the new file from the old. It is usually much smaller than the whole new file and thus is a form of compression. The diff between a first version of a document and a second version of that same document is the data difference; the data difference is the result of compression of the second version of a document using the first version of the document as a preset dictionary.
Stream-based data deduplication (“dedupe”) systems are also known in the prior art. In general, stream-based data deduplication systems work by examining the data that flows through a sending peer of a connection and replacing blocks of that data with references that point into a shared dictionary that each peer has synchronized around the given blocks. The reference itself is much smaller than the data and often is a hash or fingerprint of it. When a receiving peer receives the modified stream, it replaces the reference with the original data to make the stream whole again. For example, consider a system where the fingerprint is a unique hash that is represented with a single letter variable. The sending peer's dictionary then might look as shown in FIG. 3. The receiving peer's dictionary might look as shown in FIG. 4. Then, for example, if the sending peer is supposed to send a string such as “Hello, how are you? Akamai is Awesome!” the deduplication system would instead process the data and send the following message: “He[X]re you? [T][M] ome!” The receiving peer decodes the message using its dictionary. Note that, in this example, the sending peer does not replace “ome!” with the reference [0]. This is because, although the sending peer has a fingerprint and block stored it its cache, that peer knows (through a mechanism) that the receiving peer does not. Therefore, the sending peer does not insert the reference in the message before sending it. A system of this type typically populates the dictionaries, which are symmetric, in one of several, known manners. In one approach, dictionary data is populated in fixed length blocks (e.g., every block is 15 characters in length) as a stream of data flows through the data processor. The first time the data passes through both the sending and receiving peers, and assuming they both construct dictionaries in the same way, both peers end up having a dictionary that contains the same entries. This approach, however, is non-optimal, as it is subject to a problem known as the “shift” problem, which can adversely affect the generated fingerprints and undermining the entire scheme.
An alternative approach uses variable-length blocks using hashes computed in a rolling manner. In a well-known solution based on a technique known as Rabin fingerprinting, the system slides a window of a certain size (e.g., 48 bytes) across a stream of data during the fingerprinting process. An implementation of the technique is described in a paper titled “A Low-Bandwidth Network File System” (LBFS), by Muthitacharoen et al, and the result achieves variable size shift-resistant blocks.
Current vendors supplying stream-based data deduplication products and services address the problem of dictionary discovery (knowing what information is in a peer's dictionary) by pairing devices. Thus, for example, appliance/box vendors rely on a pair of devices or processes on each end to communicate with each other to maintain tables that let each side know what references exist in the paired peer. This type of solution, however, only works when dealing with individual boxes and units that represent “in path” pairs.
In path-paired solutions, however, are not practical in the context of an overlay network such as a CDN, where the distribution of nodes more closely resembles a tree. Thus, for example, in a representative implementation, and with respect to a particular origin server (or, more generally, a “tenant” located at a “root”), the overlay may have parent tier servers closer to the root, and client edge servers closer to the leaf nodes. In other words, instead of a box needing to be aware of a small set of one or more peer boxes (such as in known box vendor solutions), a parent tier server may need to be in contact with tens, hundreds or even thousands of edge regions, each containing potentially many servers. In this context, per machine tables cannot scale.
Thus, there remains a need to provide enhanced techniques for data deduplication in the context of an overlay network.