In recent years, there has been a tremendous proliferation of computers connected to a global network known as the Internet. A “client” computer connected to the Internet can download digital information from “server” computers connected to the Internet. Client application software executing on client computers typically accept commands from a user and obtain data and services by sending requests to server applications running on server computers connected to the Internet. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hyper Text Transfer Protocol (HTTP), the Simple Mail Transfer Protocol (SMTP), and the “Gopher” document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as “the Web.” The World Wide Web is an information service on the Internet providing documents and links between documents. The World Wide Web is made up of numerous Web sites located around the world that maintain and distribute documents. The location of a document on the Web is typically identified by a document address specification commonly referred to as a Universal Resource Locator (URL). A Web site may use one or more Web server computers that store and distribute documents in one of a number of formats including the Hyper Text Markup Language (HTML). An HTML document contains text and metadata or commands providing formatting information. HTML documents also include embedded “links” that reference other data or documents located on any Web server computers. The referenced documents may represent text, graphics, or video in respective formats.
A Web browser is a client application or operating system utility that communicates with server computers via FTP, HTTP, and Gopher protocols. Web browsers receive documents from the network and present them to a user. Internet Explorer, available from Microsoft Corporation, of Redmond, Wash., is an example of a popular Web browser application.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
Generally, a proxy server is a server that sits between a secure network, such as a corporate intranet, and a non-secure network, such as the Internet. It processes requests from computers on the intranet for access to resources on the Internet, while limiting or blocking access to the intranet from external computer systems. For efficiency purposes, it may in some cases attempt to fulfill these requests itself.
In a typical proxy server implementation, the proxy server operates to filter requests for Web pages from the corporate intranet to the Internet. Web page requests are routed by the proxy server to the non-secure network and upon receipt of a requested Web page from the non-secure network, the proxy server forwards the Web page to the end user.
Proxy servers are often configured with a local cache area which might be located on a disc drive and in which are stored Web pages that have previously been accessed. Upon receipt of a request for a previously accessed Web page, the proxy server can access the copy of the Web page stored on local disc rather than request the page from the non-secure network.
Thus, the cache contains copies of Web pages, wherein the actual Web pages exist on the non-secure network. Of course, the actual Web pages may, and often do change. When a Web page on the non-secure network changes, the copy of the Web page stored in cache becomes out-of-date. In order to minimize the probability that an out-of-date Web page will be routed to a user, it is necessary to periodically refresh the cache, i.e. re-fetch the Web page from the non-secure network.
In existing proxy servers, the decision of whether to re-fetch a Web page is made by referencing information stored in the Web page header. Generally, Web page headers may have stored therein an expiration date and a modification time. The expiration date identifies an estimated date after which the Web page can no longer be considered to be current and the modification time identifies the time the Web page was last modified. In existing proxy servers, if a Web page's expiration date has expired, the proxy server issues a request across the non-secure network to forward a new copy of the Web Page if the modification time for the Web page stored on the non-secure network is different than that stored on the proxy server. Thus, if the modification time indicates that the Web page has changed, the Web page on the proxy server is updated.
There are, however, problems presented in relying on header information for making re-fetch decisions. For example, the header information for many Web pages does not include expiration dates and modification times, thereby making it impossible to rely on this information for re-fetch decisions. Additionally, the expiration date, even when present, is not necessarily reliable as it represents only an estimate of when a Web page may be changed. Furthermore, Web page header information is stored with the actual Web pages on the non-secure network. In order to check the modification time for a Web page and make a re-fetch decision, it is necessary to access the modification time across the non-secure network. Making connections over the non-secure network slows the decision process and adds to system overhead.
Therefore, it is desirable to have an improved proxy server. More specifically, it would be a significant improvement in the art to have a mechanism by which a proxy server can selectively access either an original document located across a network or a previously retrieved copy of the document stored locally in cache based in part on the probability that the document has actually changed in some substantive way since it was last accessed. Preferably, such a mechanism will make the decision to access or not to access the original Web document without having to establish a connection with a host server that stores the original of the document. The mechanism would also preferably provide a way to continually improve the accuracy of its decisions to retrieve a document either from cache or across a network based on the actual experience of the proxy server as it tracks changed documents encountered during Web accesses. If a decision is made by the proxy server to access a document across the web as opposed to the copy in cache, the mechanism should provide a way to quickly and accurately determine if the original document has indeed changed. The present invention is directed to providing such a mechanism.