The present invention relates generally to networking technology. More specifically, the present invention relates to the caching of data objects to accelerate access to, for example, the World Wide Web. Still more specifically, the present invention provides methods and apparatus by which a network cache may be populated when initially deployed.
Generally speaking, when a client platform communicates with some remote server, whether via the Internet or an intranet, it crafts a data packet which defines a TCP connection between the two hosts, i.e., the client platform and the destination server. More specifically, the data packet has headers which include the destination IP address, the destination port, the source IP address, the source port, and the protocol type. The destination IP address might be the address of a well known World Wide Web (WWW) search engine such as, for example, Yahoo, in which case, the protocol would be TCP and the destination port would be port 80, a well known port for http and the WWW. The source IP address would, of course, be the IP address for the client platform and the source port would be one of the TCP ports selected by the client. These five pieces of information define the TCP connection.
Given the increase of traffic on the World Wide Web and the growing bandwidth demands of ever more sophisticated multimedia content, there has been constant pressure to find more efficient ways to service data requests than opening direct TCP connections between a requesting client and the primary repository for the desired data. Interestingly, one technique for increasing the efficiency with which data requests are serviced came about as the result of the development of network firewalls in response to security concerns. In the early development of such security measures, proxy servers were employed as firewalls to protect networks and their client machines from corruption by undesirable content and unauthorized access from the outside world. Proxy servers were originally based on Unix machines because that was the prevalent technology at the time. This model was generalized with the advent of SOCKS which was essentially a daemon on a Unix machine. Software on a client platform on the network protected by the firewall was specially configured to communicate with the resident daemon which then made the connection to a destination platform at the client's request. The daemon then passed information back and forth between the client and destination platforms acting as an intermediary or "proxy".
Not only did this model provide the desired protection for the client's network, it gave the entire network the IP address of the proxy server, therefore simplifying the problem of addressing of data packets to an increasing number of users. Moreover, because of the storage capability of the proxy server, information retrieved from remote servers could be stored rather than simply passed through to the requesting platform. This storage capability was quickly recognized as a means by which access to the World Wide Web could be accelerated. That is, by storing frequently requested data, subsequent requests for the same data could be serviced without having to retrieve the requested data from its original remote source. Currently, most Internet service providers (ISPs) accelerate access to their web sites using proxy servers.
A similar idea led to the development of network caching systems. Network caches are employed near the router of a network to accelerate access to the Internet for the client machines on the network. An example of such a system is described in commonly assigned, copending U.S. Pat. application Ser. No. 08/946,867 for METHOD AND APPARATUS FOR FACILITATING NETWORK DATA TRANSMISSIONS filed on Oct. 8, 1997, the entire specification of which is incorporated herein by reference for all purposes. Such a cache typically stores the data objects which are most frequently requested by the network users and which do not change too often. Network caches can provide a significant improvement in the time required to download objects to the individual machines, especially where the user group is relatively homogenous with regard to the type of content being requested. The efficiency of a particular caching system is represented by a metric called the "hit ratio" which is a ratio of the number of requests for content satisfied by the cache to the total number of requests for content made by the users of the various client machines on the network. The hit ratio of a caching system is high if its "working set", i.e., the set of objects stored in the cache, closely resembles the content currently being requested by the user group.
Unfortunately, with currently available caching systems, the performance improvement promised by providers of such systems is not immediate due to the fact that when a cache is initially connected to a router it is unpopulated, i.e., empty. Given the size of the typical cache, e.g., &gt;20 gigabytes, and depending upon the frequency of Internet access of a given user group, it can take several days for a cache to be populated to a level at which an improvement in access time becomes apparent. In fact, while the cache is being populated additional latency is introduced due to the detour through the cache.
From the customer's perspective, this apparent lack of results in the first few days after installing a caching system can be frustrating and often leads to the assumption that the technology is not operating correctly. To address this problem, providers of caching systems have attempted to populate the cache before bringing the system on line by using previous caching logs, i.e., "squid" logs, to develop the working set for the system. However, this presents the classic "chicken and egg" conundrum in that the first time a caching system is deployed for a particular network there are no previous caching logs for that network.
Another method of populating a caching system employs a web scavenging robot which polls the client machines on the network to determine what content has been previously requested. Unfortunately, this can be a relatively slow process which consumes network resources to an undesirable degree. This process also requires a good knowledge of what type of content the users of interest typically browse.
It is therefore apparent that there is a need for techniques by which caching systems may be quickly and transparently populated when they are initially deployed.