1. Field of the Invention
The invention is generally related to the Internet and more specifically related to the problem of accessing and tracking content that is accessible via the Internet.
2. Description of Related Art
The Internet, and in particular, the World Wide Web that is made possible by the Internet's HTTP protocol, have revolutionized the way in which we access information. FIG. 1 shows how the information access system 101 provided by World Wide Web 123 looks to a user of a computer 127 that has a Web browser and a hard drive 129 for persistent storage of data. Such a system is termed a web client 125. In addition to web clients 125, system 101 contains Web servers 111 that are accessible via world wide web 123. As shown in detail with regard to Web server 111(a), a Web server includes a processor 113(a) and data storage 119(a) which contains documents 121 which are accessible via the Web. These documents are termed in the following Web documents. A web document 121 may contain any kind or mixture of kinds of information; it may for example be an image or an audio file as well as a text document.
To access a document on the World Wide Web, a user of a Web browser in client 125 provides a URL (uniform resource locator) for the Web document to Web 123. Web 123 routes the URL to a web server 111(i) that contains the Web document specified by the URL. Web server 111(i) responds to the URL by providing the specified Web document via the Web to Web client 125. The browser then displays the Web document. Web documents typically contain links, i.e., URLs to other Web documents. When a user selects one of these links by clicking on it, the browser provides the URL to Web 123 and that Web document is provided to the web client by the Web server in which the Web document resides as just described.
An example URL is shown at 123. A URL has three main components: protocol 105, which specifies the Internet protocol that will be used to retrieve the Web document, in this case, the http protocol which is used in the World Wide Web, host name 107, which specifies Web server 111(i) upon which the Web document is stored, and Web page source info 109, which specifies how the Web document is to be located or otherwise produced in Web server 111(i). In example URL 103, Web page source info 109 is a pathname which indicates how the Web document is to be located in a file system accessible to Web server 111(i); in other URLs, Web page source info 109 may specify a program that queries a database to locate the Web document or even a program that constructs all or part of the Web document on the fly. Web page source info 109 is interpreted in Web server 111(a) by executing source info interpretation code 117(a).
The complete syntax for a URL is the following:
<protocol_name>://<host_name>:<port_no>/<pathname>?<parameter_list>
The <protocol_name>, <host_name>, and <pathname> have already been explained; <port_no> specifies the port on which Web server 111(a) is listening for the information specified by Web page source info 109; application programs for widely-used protocols such as the HTTP protocol have default port numbers which client 125 supplies for the protocol if no port number is specified in the URL. <parameter_list> is a list of parameters which are interpreted by source info interpretation code 117; the parameters may specify a program to be executed and data parameters for the program. The parameter list is made up of one or more parameter name-parameter value pairs that are separated by a & character:
<parameter_name>=<parameter_val>&...&<parameter_name>=<parameter_val>
Whenever a Web client 125 is connected to a physical network that provides access to World Wide Web 123, Web client 125 can access any Web server 111 that is operative at that time. Since most Web servers operate continually, most information that is available via the World Wide Web is available at any time from anywhere. Because that is so, Web users tend not to make copies of information that they have retrieved in Web client 125; instead, they save the URL of the Web document that contains the information in a list 131 of interesting URLs. One example of such a list is the “Favorites” or “Bookmarks” list provided by most Web browsers. When the user wants to access the information again, the user simply clicks on the URL in the Favorites list and thereby provides the URL to the browser.
Saving URLs instead of the Web documents they refer to has both advantages and disadvantages. Both stem from the dynamic nature of the World Wide Web. A URL is not a kind of library card catalog number for a Web document. A library card catalog number for a book uniquely identifies a particular edition of a book. If a new edition of the book comes out, it receives a new library card catalog number. The new card catalog number will be similar to the number for the other edition, since both editions will be classified in the same manner, but it will not be identical to the number for the other edition. Because each edition has its own library card catalog number, a reader who writes down the card catalog number for a particular edition and ten years later presents the number to a library that has that edition will receive the edition.
A URL, by contrast, only identifies a Web server 111(i) and a Web document which the server will return in response to the Web page source info. There is no guarantee that the server specified by the URL will be available or even still exists, or that the Web document that the server will return is the same as the one that was there when the client saved the URL. What is actually returned is completely up to the server. The advantage of this arrangement is that what the server generally returns is the most recent version of the Web document. With many Web documents, for example, those which contain weather reports or stock market prices, that is exactly what is desired. The disadvantage is that older versions of the Web document are no longer accessible by the URL and may not be accessible at all. It is further generally not clear what relationship the currently-accessible Web document has to the older versions. One area where this causes difficulty is documentation for software. Increasingly, the manufacturer of the software provides such documentation by the World Wide Web; if the URL for the documentation specifies the current version of the software, a user who has an older version may be left with no documentation at all. About the only way the user of a Web browser 127 has to deal with this problem is to save a local copy of the documentation in his Web client. In so doing, of course, the user loses one of the most important advantages of the Web: the ability to save URLs instead of copies.
One attempt that has been made to deal with this problem is to establish Web archiving services such as the one found at www.archive.org. Such services have all of the problems of general-purpose archives: they are huge, but often do not have what the individual needs, and individuals typically have little or no input into what the archive saves. Additionally, vast amounts of the information which is accessible by a Web client is not publicly available and therefore will not be archived by an archiving service. This situation occurs when the Web server is behind a firewall which separates the public Internet from a so-called intranet which employs the Internet but is accessible only to Web clients known to the organization to which the intranet belongs. The server is thus accessible by Web clients that are also behind the firewall or that are known to the firewall, but not to Web clients in general. Such intranets are now one of the preferred ways of communicating within organizations.
It is an object of the invention disclosed herein to provide techniques for overcoming the foregoing problems of accessing documents by means of their URLs.