The present invention relates generally to networking technology. More specifically, the present invention relates to the caching of data objects to accelerate access to, for example, the World Wide Web. Still more specifically, the present invention provides methods and apparatus by which a network cache may be populated when initially deployed.
Generally speaking, when a client platform communicates with some remote server, whether via the Internet or an intranet, it crafts a data packet which defines a TCP connection between the two hosts, i.e., the client platform and the destination server. More specifically, the data packet has headers which include the destination IP address, the destination port, the source IP address, the source port, and the protocol type. The destination IP address might be the address of a well known World Wide Web (WWW) search engine such as, for example, Yahoo, in which case, the protocol would be TCP and the destination port would be port 80, a well known port for http and the WWW. The source IP address would, of course, be the IP address for the client platform and the source port would be one of the TCP ports selected by the client. These five pieces of information define the TCP connection.
Given the increase of traffic on the World Wide Web and the growing bandwidth demands of ever more sophisticated multimedia content, there has been constant pressure to find more efficient ways to service data requests than opening direct TCP connections between a requesting client and the primary repository for the desired data. Interestingly, one technique for increasing the efficiency with which data requests are serviced came about as the result of the development of network firewalls in response to security concerns. In the early development of such security measures, proxy servers were employed as firewalls to protect networks and their client machines from corruption by undesirable content and unauthorized access from the outside world. Proxy servers were originally based on Unix machines because that was the prevalent technology at the time. This model was generalized with the advent of SOCKS which was essentially a daemon on a Unix machine. Software on a client platform on the network protected by the firewall was specially configured to communicate with the resident daemon which then made the connection to a destination platform at the client""s request. The daemon then passed information back and forth between the client and destination platforms acting as an intermediary or xe2x80x9cproxyxe2x80x9d.
Not only did this model provide the desired protection for the client""s network, it gave the entire network the IP address of the proxy server, therefore simplifying the problem of addressing of data packets to an increasing number of users. Moreover, because of the storage capability of the proxy server, information retrieved from remote servers could be stored rather than simply passed through to the requesting platform. This storage capability was quickly recognized as a means by which access to the World Wide Web could be accelerated. That is, by storing frequently requested data, subsequent requests for the same data could be serviced without having to retrieve the requested data from its original remote source. Currently, most Internet service providers (ISPs) accelerate access to their web sites using proxy servers.
A similar idea led to the development of network caching systems. Network caches are employed near the router of a network to accelerate access to the Internet for the client machines on the network. An example of such a system is described in commonly assigned, copending U.S. patent application Ser. No. 08/946,867 for METHOD AND APPARATUS FOR FACILITATING NETWORK DATA TRANSMISSIONS filed on Oct. 8, 1997, the entire specification of which is incorporated herein by reference for all purposes. Such a cache typically stores the data objects which are most frequently requested by the network users and which do not change too often. Network caches can provide a significant improvement in the time required to download objects to the individual machines, especially where the user group is relatively homogenous with regard to the type of content being requested. The efficiency of a particular caching system is represented by a metric called the xe2x80x9chit ratioxe2x80x9d which is a ratio of the number of requests for content satisfied by the cache to the total number of requests for content made by the users of the various client machines on the network. The hit ratio of a caching system is high if its xe2x80x9cworking setxe2x80x9d, i.e., the set of objects stored in the cache, closely resembles the content currently being requested by the user group.
Unfortunately, with currently available caching systems, the performance improvement promised by providers of such systems is not immediate due to the fact that when a cache is initially connected to a router it is unpopulated, i.e., empty. Given the size of the typical cache, e.g.,  greater than 20 gigabytes, and depending upon the frequency of Internet access of a given user group, it can take several days for a cache to be populated to a level at which an improvement in access time becomes apparent. In fact, while the cache is being populated additional latency is introduced due to the detour through the cache.
From the customer""s perspective, this apparent lack of results in the first few days after installing a caching system can be frustrating and often leads to the assumption that the technology is not operating correctly. To address this problem, providers of caching systems have attempted to populate the cache before bringing the system on line by using previous caching logs, i.e., xe2x80x9csquidxe2x80x9d logs, to develop the working set for the system. However, this presents the classic xe2x80x9cchicken and eggxe2x80x9d conundrum in that the first time a caching system is deployed for a particular network there are no previous caching logs for that network.
Another method of populating a caching system employs a web scavenging robot which polls the client machines on the network to determine what content has been previously requested. Unfortunately, this can be a relatively slow process which consumes network resources to an undesirable degree. This process also requires a good knowledge of what type of content the users of interest typically browse.
It is therefore apparent that there is a need for techniques by which caching systems may be quickly and transparently populated when they are initially deployed.
According to the present invention, methods and apparatus are provided by which a caching system may be populated quickly before its deployment. The techniques described herein employ a capability inherent in most routers to develop a working set of data objects which are then retrieved to populate the cache. The router to which the caching system is to be connected is configured to log information regarding the destinations from which network users are requesting information, i.e., net flow statistics. According to a specific embodiment, this information is then parsed to get a list of destinations corresponding to a specific port, e.g., port 80, or a group of IP addresses. These destinations are then sorted according to the frequency with which they are requested. The top N destinations are then selected for populating the cache. Cacheable objects from those destinations are then retrieved and stored in the cache. The process of retrieving and storing this data takes only a few hours. Moreover, a system administrator can configure the network router to collect the necessary traffic flow data in advance of purchasing the caching system so that, once the system is delivered, it can be populated and deployed immediately.
According to another embodiment, before beginning operation as a cache, the caching system automatically configures the router to log the traffic flow data after which it analyzes the data and retrieves the appropriate data objects. Once populated it enables itself to perform the caching function.
Thus, the present invention provides methods and apparatus for populating a network cache. A router associated with the cache is enabled to compile flow data relating to object traffic. The flow data are analyzed to determine a first plurality of frequently requested objects. The network cache is populated with the first plurality of frequently requested objects. Subsequent to populating the network cache, the network cache is operated in conjunction with the router to cache a second plurality of requested objects.
A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings.