Technical Field
This patent document relates generally to distributed data processing systems, to the delivery of content over computer networks, and to systems and methods for accelerating such delivery using prefetching techniques.
Brief Description of the Related Art
It is known in the art to use an intermediary device, such as a forward or reverse proxy server, to facilitate the delivery of content from servers to requesting client devices. It is also known for intermediary and other servers to prefetch content in anticipation of a client's request, so as to improve the speed with which the intermediary can respond to the request.
One example, not meant to be limiting, of a content delivery platform that can use prefetching is a “content delivery network” or “CDN” that is operated and managed by a service provider. The service provider typically provides the content delivery service on behalf of third parties. A “distributed system” of this type typically refers to a collection of autonomous computers linked by a network or networks, together with the software, systems, protocols and techniques designed to facilitate various services, such as content delivery and/or the support of outsourced site infrastructure. This infrastructure is shared by multiple tenants, typically the content providers. The infrastructure is generally used for the storage, caching, or transmission of content on behalf of such content providers or other tenants.
In a known system such as that shown in FIG. 1, a distributed computer system 100 is configured as a content delivery network (CDN) and has a set of servers 102 distributed around the Internet. Preferably, many of the servers are located near the edge of the Internet, i.e., at or adjacent end user access networks. A network operations command center (NOCC) 104 may be used to administer and manage operations of the various machines in the system. Third party sites affiliated with content providers, such as web site 106, offload delivery of content (e.g., HTML or other markup language files, embedded page objects, streaming media, software downloads, and the like) to the distributed computer system 100 and, in particular, to the CDN servers (also referred to as the content servers). Such content servers may be grouped together into a point of presence (POP) 107 at a particular geographic location.
As noted, the CDN servers are typically located at nodes that are publicly-routable on the Internet, within or adjacent nodes that are located in mobile networks, in or adjacent enterprise-based private networks, or in any combination thereof.
Typically, content providers offload their content delivery by aliasing (e.g., by a DNS CNAME) given content provider domains or sub-domains to domains that are managed by the service provider's authoritative domain name service. The server provider's domain name service directs end user client machines 122 that desire content to the distributed computer system (or more particularly, to one of the CDN servers) to obtain the content. A given CDN server typically operates as a proxy server, responding to the client requests, for example, by fetching requested content from a local cache, from another CDN server, from the origin server 106 associated with the content provider, or other source, and sending it to the requesting client.
For cacheable content, CDN servers typically employ on a caching model that relies on setting a time-to-live (TTL) for each cacheable object. After it is fetched, the object may be stored locally at a given CDN server until the TTL expires, at which time is typically re-validated or refreshed from the origin server 106. For non-cacheable objects (sometimes referred to as ‘dynamic’ content), the CDN server typically returns to the origin server 106 time when the object is requested by a client. The CDN may operate a cache hierarchy to provide intermediate caching of customer content in various CDN servers closer to the CDN server handling a client request than the origin server 106; one such cache hierarchy subsystem is described in U.S. Pat. No. 7,376,716, the disclosure of which is incorporated herein by reference.
Although not shown in detail in FIG. 1, the distributed computer system may also include other infrastructure, such as a distributed data collection system 108 that collects usage and other data from the CDN servers, aggregates that data across a region or set of regions, and passes that data to other back-end systems 110, 112, 114 and 116 to facilitate monitoring, logging, alerts, billing, management and other functions. Distributed network agents 118 monitor the network as well as the server loads and provide network, traffic and load data to a DNS query handling mechanism 115. A distributed data transport mechanism 120 may be used to distribute control information (e.g., metadata to manage content, to facilitate load balancing, and the like) to the CDN servers. The CDN may include a network storage subsystem which may be located in a network datacenter accessible to the CDN servers and which may act as a source of content, such as described in U.S. Pat. No. 7,472,178, the disclosure of which is incorporated herein by reference.
As illustrated in FIG. 2, a given machine 200 in the CDN comprises commodity hardware (e.g., a microprocessor) 202 running an operating system kernel (such as Linux® or variant) 204 that supports one or more applications 206. To facilitate content delivery services, for example, given machines typically run a set of applications, such as an HTTP proxy 207, a name service 208, a local monitoring process 210, a distributed data collection process 212, and the like. The HTTP proxy 207 typically includes a manager process for managing a cache and delivery of content from the machine. For streaming media, the machine may include one or more media servers.
A given CDN server shown in FIG. 1 may be configured to provide one or more extended content delivery features, preferably on a domain-specific, content-provider-specific basis, preferably using configuration files that are distributed to the CDN servers using a configuration system. A given configuration file is preferably XML-based and includes a set of content handling rules and directives that facilitate one or more advanced content handling features. The configuration file may be delivered to the CDN server via the data transport mechanism. U.S. Pat. No. 7,240,100, the contents of which are hereby incorporated by reference, describe a useful infrastructure for delivering and managing CDN server content control information and this and other control information (sometimes referred to as “metadata”) can be provisioned by the CDN service provider itself, or (via an extranet or the like) the content provider customer. U.S. Pat. No. 7,111,057, incorporated herein by reference, describes an architecture for purging content from the CDN. More information about a CDN platform can be found in U.S. Pat. Nos. 6,108,703 and 7,596,619, and as pertains to delivery of streaming media, U.S. Pat. No. 7,296,082, and U.S. Publication Nos. 2011/0173345 and 2012/0265853, the teachings of all of which are hereby incorporated by reference in their entirety.
In a typical operation, a content provider identifies a content provider domain or sub-domain that it desires to have served by the CDN. When a DNS query to the content provider domain or sub-domain is received at the content provider's domain name servers, those servers respond by returning the CDN hostname (e.g., via a canonical name, or CNAME, or other aliasing technique). That network hostname points to the CDN, and that hostname is then resolved through the CDN name service. To that end, the CDN name service returns one or more IP addresses. The requesting client application (e.g., browser) then makes a content request (e.g., via HTTP or HTTPS) to a CDN server machine associated with the IP address. The request includes a host header that includes the original content provider domain or sub-domain. Upon receipt of the request with the host header, the CDN server checks its configuration file to determine whether the content domain or sub-domain requested is actually being handled by the CDN. If so, the CDN server applies its content handling rules and directives for that domain or sub-domain as specified in the configuration. As aforementioned, these content handling rules and directives may be located within an XML-based “metadata” configuration file.
The CDN platform may be considered an overlay across the Internet which improves communication efficiency. As such, the CDN platform may be used to facilitate wide area network (WAN) acceleration services between enterprise data centers and/or between branch-headquarter offices (which may be privately managed), as well as to/from third party software-as-a-service (SaaS) providers or other cloud providers used by the enterprise users.
To accomplish these use cases, CDN software may execute on machines (potentially in virtual machines running on customer hardware) hosted in one or more customer data centers, and on machines hosted in remote “branch offices.” This type of solution provides an enterprise with the opportunity to take advantage of CDN technologies with respect to their company's intranet, providing a wide-area-network optimization solution. It extends acceleration for the enterprise to applications served anywhere on the Internet. By bridging an enterprise's CDN-based private overlay network with the existing CDN public internet overlay network, an end user at a remote branch office obtains an accelerated application end-to-end. FIG. 3 illustrates a general architecture for a WAN optimized, “behind-the-firewall” service offering such as that described above. Other information about a behind the firewall service offering can be found in teachings of U.S. Pat. No. 7,600,025, the teachings of which are hereby incorporated by reference.
As mentioned above, a CDN server can be programmed to prefetch content. For example, when a CDN server receives a request for the HTML (hypertext markup language) for a web page, the CDN server can fetch and deliver the HTML document to the client, and also parse the HTML to discover the embedded resources on that particular page. The CDN server can retrieve those embedded resources before receiving a request for them from the client. Prefetching in a CDN and site acceleration with content prefetching enabled through customer-specific configurations are described in U.S. Pat. No. 8,447,837, the teachings of which are hereby incorporated by reference.
Further it is known in the prior art for a server in a CDN platform to prefetch objects, including further HTML documents, as it delivers an initial HTML document. This technique is based on “hints” or instructions given by an origin server in the HTML that it sends to a CDN server. For example, a content provider may designate certain universal resource locators (URLs) in the HTML as prefetching candidates by inserting into certain tags (typically an <a/> tag or <link/> tag) an attribute to indicate it should be prefetched. An example is to set the HTML ‘rel’ attribute to ‘prefetch’, which is supported by HTML specifications. Upon seeing these tags in the HTML returned from an origin server, the CDN servers prefetch the designated objects. This approach may be used for both cacheable and non-cacheable content. In this case, the origin server operator (the customer of the CDN) typically looks at analytics to decide what the next logical page is that an end-user might request, and updates their HTML with the tags to effect the feature. The foregoing prior art represents work of others.
Note that prefetching applies not only in CDNs and not only in intermediary devices (whether in a CDN or not), but elsewhere as well. Generally speaking, a server communicating with a requesting client may prefetch content from local disk to memory, from memory to a CPU memory cache, and/or from a remote storage device to local buffer, and the like.
Prefetching techniques can make a significant difference in the speed with which content is delivered to a client. However, determining what should be prefetched is not necessarily straightforward. This is particular true with the complexity of modern websites and web pages, which are typically composed not only of embedded static content, but also rely on the client's execution of scripts and logic for generating, manipulating, and updating page elements, including that commonly referred to as AJAX. Improved prefetching techniques are therefore desirable. Further, applying prefetching techniques to a wide scope of content with minimal administrative cost and setup effort is desirable.
The teachings herein address these needs. They also provide other benefits and improvements to computer operation and content delivery that will become apparent in view of this disclosure.