A network is typically used for data transport among devices at network nodes distributed over the network. A node is defined as a connection in the network. Devices can be connected to network nodes by wires or wirelessly. Networks can be local area networks which are physically limited in range such as wired or wireless data networks in a campus, in an office building, or a wide-area network employing public infrastructures such as the public switched telephone networks or cellular data networks.
Data transport is often organized into transactions, wherein a device at one network node initiates a request for data from another device at another network node and the first device receives the data in a response from the other device. By convention, the initiator of a transaction is referred to herein as the client and the responder to the request from the client is referred to herein as the server.
In a client-server structured network operation, clients send requests to servers and the servers return data objects that correspond to those requests. A transaction might begin with a client at one node making a request for file data directed to a server at another node, followed by a delivery of a response containing the requested file data. In the Web environment, web clients effect transactions to Web servers using the Hypertext transfer Protocol (HTTP), which enables clients to access files (e.g., text, graphics, sound, images, video, etc.) using a standard page description language. One example of the predominant markup language for web pages is the Hypertext Markup language (HTML). Markup language data streams typically include numerous references to embedded objects which can be image, sound, video files or Web pages and components of those Web pages. Data objects might be identified by their uniform resource locator (URL). Generally, URL is a character string identifying both the location of the site and a page of information at that site. For example, “http://www.riverbed.com” is a URL. Each web site stores at least one, and often times substantial more pages. Pages, in this context, refer to content accessed via a URL.
A Web browser is a software application which enables a user to display and interact with text, images, videos, music and other information typically located on a Web page at a website on the World Wide Web or on a public or private local are network. Web browsers are the most commonly used type of HTTP user agent. Web browsers communicate with Web servers primarily using HTTP to fetch Web pages. The combination of HTTP content type and URL protocol specification allows Web page designers to embed objects such as images, videos, music and streaming media into a Web page. In practice, it is useful to distinguish between base pages and embedded objects. A user or program action (e.g., an HTTP request sent from an HTTP client) to fetch a particular URL from an HTTP server typically identifies only the base page and that base page then typically contains some number of other links to embedded objects. Typical examples of such embedded objects are images, scripts, cascading style sheets, and the like. Logically, the request for the base page implicitly also requests the embedded objects. In implementation, the base page is fetched and that page contains the information required to fetch the embedded objects. The program processing the initial base page request (for example a Web browser acting as an HTTP client) then uses the information in the base page to fetch the embedded objects. As these fetches are mostly in a serial fashion over a few connections, they result in additional round-trips to the server(s) providing the objects. Particularly in cases where the round-trip time (RTT) is high, these additional fetches lead ultimately to poor end-user experience in which pages are displayed slowly or in a fragmented way.
One possible approach to enhance user experience is to fetch the embedded objects at the same time as the base page fetch. For example, a proxy is placed between clients and servers and selectively preloads data for the clients. The proxy can watch and record patterns of interaction. When a client's fetches start to match a previously-seen pattern, the proxy can then play out the rest of the recorded pattern as a speculative effort to anticipate the client's future behavior. This might be implemented, for example, using the teachings of McCanne V in the context of web pages and HTTP.
In some applications, the proxies function as performance-enhancing intermediary between the clients and the servers. The proxies may transparently intercept, forward, modify, or otherwise transform the transactions as they flow from the client to the server and vice versa. Such proxies address the network throughput issue in the transport or application-level, as described in McCanne III and McCanne IV. Such a solution should be compatible with acceleration for secure transports such as SSL, such as that described in Day.
There are other considerations however. In order to determine which embedded objects should be fetched along with a base page, a proxy would need accurate knowledge of the association between the base page and its embedded objects. In an environment where a network/HTTP proxy receives a variety of HTTP traffic from different clients and servers, the proxy cannot easily establish an association between the embedded objects and their base pages. One reason is that base pages may contain many embedded objects, and some embedded objects are themselves web pages that may further contain embedded objects (e.g., a directory listing). It can take substantial time to parse (analyze) and classify all of them.
Another reason for this is that the HTTP protocol is stateless, so logically each client/server interaction is distinct. When considering two HTTP requests from the same client, those two requests may be addressed to the same server or different servers. They may be sequential (no intervening requests) or they may be separated (other intervening requests), and they may be related or unrelated. There is no reliable connection between these attributes in that neither the rank order nor the identity of servers can be relied upon to determine which of these interactions are grouped together. Without some reliable form of grouping, it is not possible or easy to learn associations among requests and reuse those associations for subsequent prefetching.
There have been some attempts to solve such problems, such as through the use of caching, page parsing, Markov models, or other approaches.
With a caching approach, the content (page or object) associated with a particular URL is retained in storage (called a cache) near the client. The stored (cached) content is served from the cache when a matching URL is requested, rather than forwarding the request on to the server. While this works well when the matching URL refers to matching content, caching performs poorly when URLs refer to dynamic content. If the content associated with a URL changes, a cache may serve an old, incorrect version. This kind of error is sometimes referred to as a freshness or consistency problem.
Various approaches to fix this problem attempt to set freshness intervals or explicit invalidations when content changes, but these have problems of their own. It is difficult to select good values for freshness timers, and any choice still forces a tradeoff between consistency and overhead. Explicit invalidation requires the resolution of difficult issues about control, autonomy, and scale, because a change at a server causes the discarding of many cached copies. In the limiting case, caching is simply useless for content where every fetch of a given URL yields a different value—such as a URL for a real-time clock. Nonetheless, it is important to be able to accelerate a complex page that includes one or more such embedded dynamic URLs.
With a page parsing approach, a proxy examines a base page as it is passing from the server to the client and simply follows links. In its simplest form, the proxy simply fetches all URLs found on the page. Some common refinements include parameters to control the depth or breadth of such prefetching, or the use of heuristics to focus additional prefetching effort on certain kinds of links while ignoring others.
Simple page parsing systems are often worse than avoiding prefetching entirely, as they can prefetch vast quantities of irrelevant information, consuming network and server resources for little benefit. More sophisticated page parsing systems are complex collections of heuristics, and suffer from the usual problems of adaptation and maintainability for such systems. That is, at a certain level of complexity with multiple interacting heuristics, it becomes difficult to determine whether a new heuristic is actually improving performance. The complexity of the parsing process is also increasing over time, as HTML base pages increasingly use embedded objects such as cascading style sheets to control which parts of the page are presented and thus which other embedded objects need to be fetched.
With a Markov models approach and similar learning approaches, there is an assumption of repeating patterns of access and the proxy may build statistical models over time to determine when the start of a previously-seen sequence is likely to match other previously-fetched URLs. However, because of the previously-mentioned statelessness of HTTP and the difficulty of grouping URLs at a proxy, many sequences of URLs seen at the proxy may represent meaningless differences in interleaving of repeating sequences. To successfully learn the sequences despite the changes in interleaving, a Markov model may require a very large state space and correspondingly long learning time. In general, this brute-force approach is intractable since the complexity of the learning increases exponentially with increases in the length of sequences and number of interleaved sequences.
In view of the above, what is needed is an improved approach for associating embedded objects with base pages that is usable in a proxy and more effective than prior approaches.