1. Technical Field
The invention relates to the exchange of information over an electronic network. More particularly, the invention relates to a proxy server caching mechanism that provides a file directory structure and a mapping mechanism within the file directory structure in an electronic network.
2. Description of the Prior Art
Modern information networks, e.g. the Internet, use servers to store documents. In the World Wide Web (Web), these documents are addressed by uniform resource locators (URLs). URLs specify the protocol by a prefix in the URL, such as http:// for HyperText Transfer Protocol, the host in the Internet where the document is stored, and the address of the document within that host. The Web is thus not a single protocol, but a combination of several protocols united by a common addressing scheme, i.e. the URL.
The tremendous continuing growth of the Web makes it necessary to have intermediate servers which perform caching (store documents locally, such that the documents may be quickly accessed from the local file system, instead of being retransferred again from the original server. Such servers (see, for example A. Luotonen, K. Altis, World wide Web Proxies, Proceedings of First International World-Wide Web Conference, Geneva 1994) are referred to as caching proxy servers, or proxies for short. See, also A. Chakhuntod, P. Danzig, C. Neerdaels, M. Schwartz, K. Worrell, A Hierarchical Internet Object Cache, USENIX 1996 ANNUAL TECHNICAL CONFERENCE, http://usenix.org/publications/library/proceedings/sd96/danzig.html). Proxies reduce network load, and shorten response times to the user.
FIG. 1 is a block schematic diagram of a proxy server 14. When a client 12 requests a new document from the proxy server 14, the proxy server copies the document from the origin server 16 to its local file system in addition to sending the document to the client 12. When another request comes for the same document, the proxy server returns the document from the cache 15 if the cached copy is still up to date. If the proxy server determines that the document may be out of date, it performs an up-to-date check from the remote origin server and refreshes the document, if necessary, before sending it to the client 12.
Within a proxy server, an internal addressing mechanism is necessary to map the URLs to their location in the cache of the proxy server. Historically, the first caching proxy server, the "CERN httpd," mapped the URL directly to a UNIX file system path such that, for example:
http://home.netscape.com/some/file.html would become: PA1 /cache-root/http/home.netscape.com/some/file.html. PA1 The path names for cached files could get extensively long and thus the file itself was time-consuming to located in the file system; PA1 The directory holding the subdirectories corresponding to the second part of the URL (the host names) could get extensively large (i.e. thousands of entries). This is not very efficient in most commonly used operating systems, such as UNIX, because such operating systems must perform a time-consuming sequential searching through the directory to actually locate the desired file; and PA1 The maintenance of such a cache becomes extensively hard because the contents of the cache are location-dependent, and the material at a given location in the cache is not "average." Thus, deciding which files should be kept in the cache and which removed therefrom is difficult. PA1 It is slow to start up the proxy server because it has to load the map file into the memory. Such process typically takes several minutes, thus exacerbating system latency; PA1 It is wasteful of the RAM (main memory of the computer) because such central map file can get very big; and PA1 It is fragile because the entire cache becomes unusable if the map file is lost or damaged.
This, however, was inefficient in several ways, including:
Another caching proxy server, the Harvest Cache Daemon (see A. Chakhuntod, P. Danzig, C. Neerdaels, M. Schwartz, K. Worrell, A Hierarchical Internet Object Cache, ibid.), does not use the URL as part of the mapping scheme, but simply assigns a random file name to each URL, and maintains a central file containing the mappings from URLs to file names. This approach also has limitations, such as for example:
The ability to locate documents in the cache without latency induced by long path names and large directories is very important. There is also another important aspect of the cache design, that is to make it easy to clean up old cache documents that are no longer necessary (i.e. garbage collection).
As described above, the "CERN httpd" proxy server cache has the undesired quality of being very location specific, i.e. each directory could contain entirely different types of documents, and no given directory could be considered "average." For example, some directories might contain only GIF images, whereas others would contain only HTML files (HyperText Markup Language), or Postscript files. Thus, when the "garbage collector" is traversing the cache structure, it is impossible to know beforehand how much data and how many documents are in the cache because the structure of the cache is not known until the cache is entirely traversed. Therefore, it is hard to make effective decisions about which documents should be kept and which should be removed.
It would be advantageous to provide a proxy server cache structure that stores and accesses documents in an optimum manner in a storage hierarchy that is easily managed.