This invention relates generally to computer data storage and retrieval, and more particularly to the indexing and retrieval of objects stored in a cache.
A cache is an amount of data storage space that is used to hold recently accessed data to allow quick access in subsequent references. The retrieval of data from the cache is typically significantly faster than accessing a data source from which the data in the cache are originally obtained. By storing recently accessed data in the cache, the data can be retrieved and made available quickly the next time they are requested. Data caching is one of the most fundamental concepts in the computer science and has been widely applied in various applications where it is desired to minimize the data access time. The effective implementation of a cache system is, however, not a simple matter, especially when the cache space and the number of cached data objects become very large.
For instance, in the context of accessing information available from the World Wide Web (xe2x80x9cWWWxe2x80x9d) on the Internet, it is common for a proxy server of a private network to cache data objects downloaded from various Websites in response to requests by computers on the private network. When the proxy server receives a request from a computer on the private network for a particular data object, it checks the cache (often referred to as the xe2x80x9ccache unitxe2x80x9d) to see whether the requested object is already in the cache. If the requested object is not in the cache, the proxy server forwards the request on to the Internet so the requested data object can be downloaded from a Website. On the other hand, if the requested data object is found in the cache (which is called a xe2x80x9ccache hitxe2x80x9d), it is retrieved from the cache and sent to the requesting computer. In this way, the need to keep the user waiting while the requested data object is being downloaded through the Internet is avoided. Since the speed of downloading data objects from Websites can be very slow, a properly implemented cache unit can significantly reduce the average amount of time for a user to receive a requested data object, thereby providing a significantly improved user experience.
The performance requirements on the cache unit, however, can be very high. For instance, a typical implementation of a cache unit may be expected to hold up to 50 millions data objects. With such a large number of data objects, it can become very difficult to control the amount of resources required for implementing the various components of the cache system or to guarantee the adequate performance of the caching operation.
For instance, like many database systems, a cache system typically sets up an index for the cached objects to allow the identification and retrieval of the cached objects. A conventional indexing scheme typically has one index entry for each of the cached objects, and the entry contains a key that identifies the object and typically includes other data describing the object. For instance, in the context of caching objects downloaded from the Internet, the key for a downloaded object is the object""s URL (Universal Resource Locator), the size of which may vary from a few bytes to several hundred bytes or more. The average size of the URLs of the downloaded objects has been observed to be around 50 bytes at the present state of Internet usage, but is expected to grow larger as more and more contents are put on the Internet.
When the number of objects is large and the data describing each object is relatively large, the index can take up a large amount of storage space. For example, if the cache is expected to hold 50 millions of downloaded objects and the average URL size is about 50 bytes, about 2.5 gigabytes will be required just for storing the keys of the cached objects. Moreover, semaphore objects (or alternative synchronization objects) are typically provided for controlling access to the cached objects. The use of semaphore objects further increases the memory space requirement of the conventional indexing scheme.
Conventional cache systems that support such a large scale of operation would have to store the index onto a hard disk or the like because of the large amount of storage space required for the index. Putting the index on a disk, however, has the significant disadvantage that extra disk I/O operations are required to access the index. For instance, when a request for a data object is received, a disk I/O operation will be performed to read the index, which returns either a found object (i.e., a cache hit) or and indication that the object is not found (i.e., a cache miss). For a cache system that aims at high performance, it is necessary to minimize the average number of I/O operations for each object search and retrieval, and adding an I/O operation for each index search may not be an acceptable option.
In view of the foregoing, the present invention provides a system and method for indexing and retrieving objects stored in a cache on a persistent medium (e.g., a disk) that introduces the concepts of probable hits and asynchronous retrieval of cached objects during a search. An index according to the invention is much smaller than a conventional index such that it can be stored in the computer memory to avoid any additional I/O operations. Rather than having a separate entry in the index for each cached object and storing a full key (e.g., a URL) in each entry, each index entry is used as a xe2x80x9cbucketxe2x80x9d for holding object references, including access information (e.g. pointers), for cached objects corresponding to that index entry. Specifically, each index entry has an index entry identification ((xe2x80x9cIDxe2x80x9d), and any cached object corresponding to that entry has a key that when operated on by a predefined lossy compression mechanism results in the index entry ID. Because the compression mechanism is lossy, there may be multiple cached objects that correspond to a given index entry.
When the cache manager receives a request for a data object, it checks whether the requested object is already stored in the cache. To that end, the key of the requested object is compressed to get the ID of the relevant index entry, and that index entry is checked to see if there is any cached object in that xe2x80x9cbucketxe2x80x9d. Finding an empty bucket indicates that the requested object is not in the cache (i.e., a cache miss). If the bucket contains object reference information for one or more cached objects, the cached objects are checked to see whether one of them is the requested object. If a cached object in the bucket is possibly the requested object, an asynchronous I/O operation is performed to retrieve that cached object to see whether its key matches the key of the requested object. Because of the retrieval is performed asynchronously, the thread that makes the request to retrieve a cached object in the cache does not have to wait for the completion of the I/O read operation before it can turn to other tasks.
To further reduce the amount of memory space required for implementing the index, a lightweight synchronization scheme is used instead of conventional semaphore objects. This scheme uses a short state field (e.g., 2 bytes) in each index entry that is set to indicate whether the index entry is involved in read/write operations.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments, which proceeds with reference to the accompanying figures.