The present invention relates to transmission of data in a network environment. More specifically, the present invention relates to methods and apparatus for improving the efficiency with which data are transmitted over the Internet. Even more specifically, the invention relates to a file system designed with the special requirements of a cache memory in mind.
Currently, most Internet service providers (ISPs) employ various techniques such as, for example, caching proxy servers, to accelerate access to the World Wide Web. Having evolved from early firewall models, caching proxy servers service data requests intended for the Web using data stored locally from previous similar requests. For historical reasons, the file system employed by such a proxy server is a Unix-based general purpose file system. Unfortunately, while Unix-based file systems are extremely versatile and useful in the applications for which they were designed, they are not optimized for the caching function and, in fact, represent the primary traffic bottleneck in caching proxy servers.
The basic requirements of a Unix-based file system are that its files are persistent and mutable. That is, once a file is created, the user can reasonably anticipate that the file will remain there forever, i.e., persistence, or at least until he explicitly deletes it. Moreover, a file, once created, may be changed at any time, i.e., mutability. By contrast, the basic requirements of a caching file system are just the opposite. That is, because there is a finite amount of memory in which the most recently and frequently requested objects must be stored, older files become obsolete. Thus, files in a caching file system should be transient. Moreover, objects stored in a caching file system are downloaded, read a number of times before being deleted or overwritten (many times if the cache is operating efficiently), but never modified. Thus, files in a caching file system should be immutable.
The Unix file system was designed with the expectation that a relatively small number of relatively large files would be stored within its multilevel hierarchy. This is in contrast to millions of small files, e.g., multimedia objects, typically stored in a caching proxy server. The small number of files anticipated meant, in turn, that a correlatively small number of "creates" were expected. Because only a small amount of time overhead was anticipated to be associated with the "create file" function, it was not optimized for speed. By contrast, caching proxy servers create files at a rate well beyond that anticipated for a Unix-based system, e.g., 1000 creates/second. Therefore, the create function must be fast.
The multilevel directory structure itself has a rather large overhead, consuming about 20% of available storage. This represents an inefficiency in a caching file system which, by its very nature, needs to make use of as much of its memory as possible for the storage of cache objects. In addition, because of the multilevel nature of the directory structure, significant time is required to traverse the directory levels when searching for files. This overhead becomes problematic given the number of transactions per unit time typical for a caching proxy server. The overhead is also problematic in that the Unix-based file system is scaleable only to about 20-25 Gigabytes at which point it becomes unacceptably unwieldy and slow. This is unfortunate because current Internet traffic is already pushing this limit and it is obviously desirable to increase cache capacity as traffic increases.
Finally, an extremely important feature of a caching file system, i.e., garbage collection, is not typically provided in a Unix-based file system. Rather, in such a file system it is assumed that the user will delete files and optimize the disk as needed. Therefore, if a Unix-based file system is to be used as a caching file system, a separate, process must be designed with the capability of repeatedly searching through the complex directory structure and identifying and flushing obsolete files. This garbage collection process must not begin flushing objects too soon or the disk space will be underutilized and the cache hit rate correspondingly lowered. On the other hand, the process must not wait too long to begin garbage collection because the performance of the file system bogs down as the disk becomes full. Thus, not only must a separate process be designed to perform the necessary garbage collection function, it must be an extremely sophisticated process, therefore significantly increasing the already large overhead associated with the file system.
The Unix-based file system currently employed by ISP proxy servers incorporates file characteristics (persistence and mutability) which are opposed to the requirements of a caching file system. It is based on a multilevel directory structure which is problematic for the caching function in a number of respects. Finally, because it does not provide garbage collection, a separate, highly complex process must be designed. It is therefore apparent that a file system is desirable which is more in line with the requirements of the caching function.