1. Field of the Invention
This invention relates generally to the caching of Internet pages.
2. Prior Art
Information on the Internet is represented by files stored in computers running programs called servers, and is accessed by users with computers running computer programs called clients. The Internet includes several different services, the most popular being the World Wide Web, which is often simply referred to as the Web. Information on the Web is provided by Web servers. A client called a Web browser is usually used for accessing the Web, but there are other tools that can be used. Many Web browsers can also access other services on the Internet, such as FTP (File Transfer Protocol) and Gopher.
Information on the Web are currently represented by specially formatted text files called Web pages, each of which is a text document written in HTML (Hypertext Markup Language) or another language, such as XML, HDML, VRML, etc. Each page has a Web address called an URL (Universal Resource Locator). A typical page also includes hyperlinks, which are either underlined text or graphical images that point to the addresses of other pages, so that when one is clicked or selected, a request will be sent out for the associated page. E.g., when the Products hyperlink in a merchant's home page is selected, a request is sent out by the client to the address specified in the hyperlink, and the associated Web page is downloaded and displayed, replacing the home page on the user's screen with a new page showing the merchant's products. A hyperlink may simply be referred to as a link.
Browsers or clients typically use a communication protocol called HTTP (Hypertext Transfer Protocol) to request pages from Web servers. HTTP is a request/response protocol. Through a connection established between a client and a server, a request is send by the client to the server, and a response is provided by the server to the client. Due to the vastness and complexity of the Internet, there are usually intermediaries between the client and the origin server, which is the server that the request is intended for. Typically, a request generated by a client is relayed to the origin server by one or more other servers, such as proxy servers and gateways. A proxy server is an intermediary program which acts as a server for receiving requests from clients, and also as a client for forwarding the requests onto other servers. A gateway is also an intermediary, but it receives requests as if it is the origin server. Any server that only passes along a request is called tunnel. Many servers store files that pass through them temporarily in a local store called a cache. Except for tunnels, a response to a request could be returned by any intermediary server if the file being requested is in its cache.
A frustrating aspect of using the Internet is the long delays associated with downloading pages from the Web or other services. Therefore, a cache is also typically implemented by a client. Whenever a page is received by the client, it is stored in the cache. Some pages are cacheable, i.e., they are identified as being allowed in the cache according to various parameters, whereas other pages are not cacheable, After prolonged Internet use, the cache would be filled with a very large number of pages. When a page is requested by the client, such as by clicking on a link, a hit test is first performed on the cache, i.e., the cache is checked for the presence of the page. If there is a hit and the cache is valid (e.g., not expired), the page is fetched from the cache, so that it is displayed almost instantly, and the user is spared from downloading the page through the slow communication link. If there is a hit with an invalid respond, there is no hit, or the client does not maintain a cache, a request is sent by the client to the origin server. When the request is received by an intermediary server, such as a proxy server, a hit test will be performed on its cache. If there is a hit with a valid response, the requested page is sent by the intermediary server to the client. The response is treated by the client as if it is sent by the origin server, so that the request is fulfilled. If there is a hit with an invalid response (e.g., page has expired), there is no hit, or the intermediary server does not maintain a cache, the request is forward to another server. A cache has a finite capacity for storing pages, so that older pages are constantly being replaced by more recently received pages. Some pages show frequently updated information, such as stock quotes or weather, so a code is included in these pages to prevent them from being cached, i.e., they are not cacheable.
Any server between the client and the origin server, other than a tunnel, that implements caching can respond to a request if its cache can provide a valid response. When the requested file is found and passed to the client, any server other than a tunnel may save a copy of it into its cache if the page is cacheable. When a requested page is received by the client, it will be displayed and also saved to the client's cache if it is cacheable. The client will wait for the user to select another link or enter another address before generating another request.
Due to the vast number of pages available on the Web, a requested page is not likely to be cached by the client or any server along the connection, unless the page was recently visited by the user. Therefore, most of the time when a link is clicked and the associated page is requested, the user has to wait for the page to be downloaded. The downloading time can typically range from several seconds to over a minute. Much of the Web surfing experience is thus comprised of a great deal of waiting for pages to be downloaded.
Most proxy servers implement a caching mechanism very similar to that employed by clients. Since a proxy server serves many clients, its cache is usually very large and the caching scheme is elaborate. However, the basic principle of a proxy's caching mechanism is the same, i.e., return a page to a client if there is a valid response from proxy's cache, otherwise forward the request to another server, and when the response is received, save it in the cache and also forward it to the client.
When a user is reading a Web page with a client, the processor and communication modules are idle, and simply waiting for the user to clink on another link. Such wasted processing and communicating capabilities are put to use by some products, such as a browser plug-in sold under the trademark "NETACCELERATOR" by IMSI. When a user is reading a page and not clicking on any link, the addresses specified by the links on the page are automatically contacted and their associated pages downloaded into a cache by "NETACCELERATOR." Because these pages are downloaded while user is occupied with reading the displayed page, their associated downloading times are transparent to the user. Theoretically, when a link on the displayed page is eventually clicked by the user, the associated page is already cached, so that it will be displayed almost instantly, and the downloading time is not experienced by the user. However, many Web pages contain a large number of links. Caching all the pages associated with all the links can take many minutes or even hours, which may be much longer than the time spent by the user on the original page. As a result, the pages for some links may not yet be cached, so that the user will still experience downloading time for such links. No information is disclosed by IMSI about the particular order, if any, that the pages are downloaded. Therefore, its caching scheme may not be the most efficient.
A product sold under the trademark "GOT IT!" by Go Ahead Software is also a client plug-in for downloading pages associated with links on a Web page. Another product sold under the trademark "GOT IT! ENTERPRISE" by Go Ahead Software is for downloading pages to local servers. No information is disclosed by Go Ahead Software about the particular order, if any, that the pages are downloaded by either product. Therefore, their caching schemes may not be the most efficient. These prior art read-ahead caching programs will initially benefit individual users. However, if they are used by a large number of users, they can overload the Internet with a hugely increased number of requests, so that all users will end up suffering even greater delays.