1. Technical Field
This application relates generally to data communication over a network.
2. Brief Description of the Related Art
Distributed computer systems are well-known in the prior art. One such distributed computer system is a “content delivery network” or “CDN” that typically is operated and managed by a service provider. The service provider typically provides the content delivery service on behalf of third parties (customers) who use the service provider's shared infrastructure. A distributed system of this type is sometimes referred to as an “overlay network” and typically refers to a collection of autonomous computers linked by a network or networks, together with the software, systems, protocols and techniques designed to facilitate various services, such as content delivery, application acceleration, or other support of outsourced origin site infrastructure. A CDN service provider typically provides service delivery through digital properties (such as a website), which are provisioned in a customer portal and then deployed to the network.
Data differencing is a known technology and method to leverage shared prior instances of a resource, also known as versions of data within a shared dictionary in compression terminology, between a server and a client; the process works by only sending the differences or changes that have occurred since those prior instance(s). Data differencing is related to compression, but is a slightly distinct concept. In particular, intuitively, a difference (“diff”) is a form of compression. As long as the receiver has the same original file as a sender, that sender can give the receiver a diff instead of the entire new file. The diff in effect explains how to create the new file from the old. It is usually much smaller than the whole new file and thus is a form of compression. The diff between a first version of a document and a second version of that same document is the data difference; the data difference is the result of compression of the second version of a document using the first version of the document as a preset dictionary.
Many HTTP (Hypertext Transport Protocol) requests cause the retrieval of only slightly-modified instances of resources for which the requesting client already has a cache entry. For example, an origin server may publish a page of stock quotes for every company listed in the S&P 500. As time goes on and the quotes change, the overall page remains very similar. The names of the companies and their ticker symbols, CSS, images, and general HTML formatting probably remain unchanged from version to version. When the client requests an updated page, however, it will end up downloading the content in its entirety, even those items discussed above that do not differ from the data the client has already downloaded in prior versions. Because such modifying updates may be frequent and the modifications are often much smaller than the actual entity, the concept of “delta encoding”—by which the sending entity would transfer a minimal description of the changes, rather than an entire new instance of the resource—was proposed for HTTP. This concept, which is a way to make more efficient use of network bandwidth, was described in Internet Request for Comment (RFC) 3229.
Delta encoding per RFC 3229 does not address all issues that arise in the context of a distributed overlay network, such as a content delivery network. The largest concern is that the approach is based on the origin server doing the differencing. In the case of an overlay network, however, the service provider desires to provide services for customers so they do not have to add new software to their origin servers. Indeed, many customers will have vendor solutions that prohibit them from adding software or otherwise make it difficult. Therefore, an overlay provider will most likely have to do differencing in another server that sits in front of the origin server, primarily because the provider does not have all of the new version data present on disk or in-memory and against which a data difference might need to be calculated. The overlay network provider, in this context, receives data over the wire and has to wait orders of magnitude longer than a disk read or memory fetch to get all of it. In an RFC 3229-compliant solution, there is no way to start the differencing process on chunks and then send those down to the client while simultaneously reading new source chunks from the origin. Additionally, RFC 3229 relies upon e-tags and “last modified time” to reference a prior version document.
Another approach to this problem is provided by a technology called Google SDCH, which is another HTTP data difference mechanism. The main difference between it and RFC 3229 is that SDCH allows a dictionary to be something other than a previous version of the content. It also allows sharing of that dictionary between multiple resources. For example, if there are three HTML files that each contained a set of common phrases, the SDCH approach enables the creation of a single dictionary that can then be referenced by each HTML file. The user agent downloads that dictionary (D) separately; whenever it needs one of the HTML files, it then instructs the server to “give me HTML file X compressed with dictionary D.” The server then sends the compressed file and the client de-compresses it using the shared dictionary. While this approach is efficient, there is no easy way to compute the shared dictionary.
While these known differencing approaches provide useful advantages, there remains a need to provide enhanced techniques for data differencing in the context of an overlay network.