1. Field of the Invention
The present invention relates generally to Web content storage servers and more particularly to Web caching systems.
2. Description of the Prior Art
The Internet is rapidly becoming an important means of providing information and communicating with others, regardless of geographic location. One of the primary innovations responsible for the increase in use of the Internet is the World Wide Web. The World Wide Web (Web) is a set of protocols that enables users to access text, graphical data and other multimedia data from various geographic locations. This text, graphical data and other multimedia data, individually known as Web objects and collectively known as Web content, is typically organized as Web pages. A Web page may be implemented in any document markup language, such as Hypertext Markup Language (HTML) or Extensible Markup Language (XML). Document markup language commands for a Web page are stored in a file on a Web content server.
Other components associated with a Web page, such as images, graphs, charts, icons and other Web objects are typically each stored in a separate file. These components may be embedded in a Web page by including a reference to the embedded object in the document markup commands in a Web page file. An embedded object that is referenced in this way in a Web page is typically downloaded each time the Web page itself is downloaded. Alternatively, using document markup language commands in the Web page file, a Web object may be hyper-linked in a Web page, resulting in a hyper-link to the object being displayed rather than the object itself. To access the hyper-linked object, a user may select the hyper-link and the object is then downloaded. Although Web objects are stored in individual files, many or all of the files containing Web objects embedded in a particular Web page may be retrieved at nearly the same time as the file containing that Web page. Files containing Web objects hyper-linked in that Web page may be retrieved shortly thereafter.
To view a Web page, a computer user may launch a Web browser application on his computer. The Web browser allows a user to enter a Uniform Resource Locator (URL) that specifies the desired Web page. The Web browser then submits a request for the Web page over a communications network to an Internet Service Provider. The Internet Service Provider may satisfy the request in at least two ways.
The Internet Service Provider may submit the request across the Internet to a Web content origin server that stores the Web page data. The overall retrieval time for a request for a particular Web page includes the amount of time necessary to route a request to a Web content origin server, as well as the time to retrieve the Web page data from a storage device on the Web content origin server. Additionally, the overall retrieval time includes download time, which is the amount of time necessary to transfer data from the Web content origin server to a client computer, after being retrieved from disk on the Web content origin server.
Alternatively, an Internet Service Provider may utilize a Web caching server (also commonly referred to as a Web caching proxy) to decrease the overall retrieval time for the request. This Web caching server may be located on the Internet Service Provider's premises and stores frequently accessed Web content. Utilizing a Web caching server may significantly reduce the request routing time and the download time. For accesses that “hit” in the cache, i.e., for which a current copy of the requested data is present in the cache, the request routing and the download time are greatly reduced. In addition, a Web caching server also reduces the bandwidth used by the Internet Service Provider. However, the time necessary to retrieve Web page data from a disk on the Web caching server still remains a significant factor in overall retrieval time.
In some conventional Web caching servers, Web objects are stored by using file system commands for writing data to a magnetic disk. Most operating systems provide an electronic file system and directory structure in which system files, computer programs, and user generated files may be stored. In addition to providing a structure for electronic files, an operating system typically includes software routines and file system commands that may be used to store, modify and access files in the file system. However, the storage and retrieval routines provided with an operating system are generally not aware of the logical relationships between different files, and in particular they are not aware of embedding or hyper-linking relationships between files containing Web content. As a result of storing data in this manner, files and data that are logically related are often not co-located on a magnetic disk. Therefore, retrieval time for Web content stored on a magnetic disk is increased, as explained in the following paragraphs.
Magnetic disks are the most common type of storage device for Web content. One or more magnetic disks may be coupled to a Web content server or a Web caching server. A disk is a mechanical device, including one or more platters, a spindle on which the platters are mounted, and a disk arm having disk heads that read and write data to and from the disk. The disk operates in a continuous rotating motion at a fixed speed while the disk arm may be moved in and out to access portions of the disk. Each platter is divided into a number of annular disk tracks. The platters of the magnetic disk are arranged in a vertical stack, such that corresponding disk tracks on the platters may be accessed without requiring a movement by the disk arm. The corresponding disk tracks are collectively known as a cylinder. Each disk track is further divided into a number of disk sectors that are typically of a fixed size. The amount of time necessary to retrieve data from a magnetic disk includes time allotted to four operations. In order to retrieve data stored on a particular sector of the disk, the proper platter is selected in a process called head selection. A seek is performed such that the disk arm is moved to place a disk head over the proper track. Then, a time period called rotational latency is required to allow the disk head to arrive above the proper sector on the track. Finally, the data is transferred from the sector of the disk. Of the four operations, the time necessary to perform a seek is the most significant and dominates the other three steps. Furthermore, the seek time grows considerably with the length of the seek (the number of cylinders between the start and the end of the seek).
Electronic files and data are typically stored on a magnetic disk as one or more disk blocks. Generally, disk blocks are a fixed size series of bytes (e.g., 512 bytes) that are allocated by a file system to store a portion of a file. An electronic file may be stored as several disk blocks located at different tracks or platters on a magnetic disk. Typically, each file includes an index to all of the disk blocks for that file. In a UNIX® operating system, each file has an inode that stores administrative information about the file, including an index to all of its disk blocks.
The concept of allocating disk blocks for a single file in co-located positions on the disk and thereby reducing disk access times is well known. McKusick et al. teaches a UNIX® file system, the UNIX® Fast File System (FFS), in which allocation of disk blocks for each file is optimized for the purpose of reducing the number of seek operations necessary to read the file. M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry, “A fast file system for UNIX®,” ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197 (August 1984). Fast File System (FFS) uses the concept of cylinder groups to facilitate file allocation. A cylinder group is a collection of neighboring cylinders. The UNIX® FFS attempts to allocate all inodes of files from the same file system directory in the same cylinder group on a magnetic disk. Additionally, the UNIX® FFS attempts to allocate all disk blocks of a particular file in the same cylinder group as its corresponding inode. Although the UNIX® FFS attempts to store files according to the previously mentioned algorithm, if disk blocks in the desired positions are not available, the UNIX® FFS will store the data in other locations on the disk. The McKusick et al. paper is different from the present invention in that the FFS merely attempts to allocate the disk blocks of a single file in co-located locations on the disk, but does not attempt to store related files in co-located positions. Additionally, the FFS does not provide a method of decreasing disk retrieval times for Web content by taking advantage of the embedding or hyper-linking relationships between files representing Web content.
In another reference, Rosenblum et al. teaches a UNIX® file system, the log-structured file system (LFS), in which the allocation of disk space on a magnetic disk is optimized in order to improve the performance of write operations. M. Rosenblum and J. K. Ousterhout, “The design and implementation of a log-structured file system,” ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52 (February 1992). All data for files, including inodes and data blocks, is written to a sequential log on the disk. As a result, write operations are fast because seek operations are avoided while writing. However, the performance of read operations is not improved in the system set forth in Rosenblum et al. Also, the system does not attempt to store related files in co-located positions on the disk, and does not provide a method for decreasing disk retrieval times for Web content by taking advantage of the embedding or hyper-linking relationships between files representing Web content.
Therefore, it would be beneficial to provide a system for storing related files, and in particular Web objects with correlated retrieval times, such that the amount of seek time required to retrieve the files is reduced.