The Internet enables virtually instantaneous worldwide distribution (publication) of information at relatively low cost. As a result, a mind-boggling amount of information is posted and available on the Internet. In fact, the ease with which information can be published over the Internet and the concomitant ease with which such information can be updated, modified, or deleted create a novel set of disadvantages for those who use the Internet for information publication purposes. Namely, the fantastically dynamic nature of the Internet makes it very difficult for an Internet publisher whose publications undergo frequent and/or regular changes to know when and where each version of a given publication was published on the Internet.
The Internet is an international network of computers and computer networks connected to each other through routers using the IP protocol and sharing a common name and address space. One can communicate with any computer connected to the Internet simply by establishing a connection to an Internet router or node. The Internet is not a corporation or administrative arrangement; it is a method for connecting computer systems and the phenomenon of very widespread adherence to that method.
The Internet began in the 1960s with federally subsidized connections among universities and government research laboratories. It is the outgrowth of what began in 1969 as an experimental project of the United States Department of Defense's Advanced Research Project Agency (“ARPA”) called “ARPANET,” which was designed to enable computers operated by the military, defense contractors, and universities conducting defense-related research to communicate with one another by redundant channels even if some portions of the network were damaged by, for example, a war, a natural disaster, or a technical failure. The network later allowed researchers across the country to access directly and to use extremely powerful supercomputers located at a few key universities and laboratories. During the early days of the Internet, traffic unrelated to research and education was limited. But by approximately 1990, the Internet's potential as a model for an international information infrastructure had been recognized, and the federal government began to reduce the subsidy and to encourage private entities to take over responsibility for basic communication and traffic management functions. By 1995, the Internet had become a predominantly private and unsubsidized network. The Internet is now the quintessential open network.
From its inception, the network was designed to be a decentralized, self-maintaining series of redundant links between computers and computer networks, capable of rapidly transmitting communications without direct human involvement or control and with the automatic ability to reroute communications if one or more individual links were damaged or unavailable.
To achieve this resilient nationwide (and ultimately global) communications medium, the ARPANET encouraged the creation of multiple links to and from each computer (or computer network) on the network. Thus, a computer in Washington, D.C., might be linked (usually using dedicated telephone lines) to other computers in neighboring states or on the Eastern seaboard, which themselves would be linked to other computers.
A communication sent over this redundant series of linked computers could travel any of a number of routes to its destination. Thus, a message sent from a computer in Washington, D.C., to a computer in Palo Alto might first be sent to a computer in Philadelphia and then be forwarded to a computer in Pittsburgh and then to Chicago, Denver, and Salt Lake City, before finally reaching Palo Alto. If the message could not travel along that path (because of military attack, simple technical malfunction, or other reason), the message would automatically (without human intervention or even knowledge) be rerouted, perhaps, from Washington, D.C., to Richmond and then to Atlanta, New Orleans, Dallas, Albuquerque, Los Angeles, and finally to Palo Alto. This type of transmission and rerouting would likely occur in a matter of seconds.
The nature of the Internet is such that it is very difficult, if not impossible, to determine its size at a given moment. It is indisputable, however, that the Internet has experienced extraordinary growth in recent years. In 1981, fewer than 300 computers were linked to the Internet, and by 1989, the number stood at fewer than 90,000 computers. By 1993, over 1,000,000 computers were linked. At the end of the twentieth century, over 10,000,000 host computers worldwide, of which approximately sixty percent were located in the United States, were estimated to be linked to the Internet. This count does not include the personal computers people use to access the Internet. All told, reasonable estimates as of the beginning of the twenty-first century are that as many as 200,000,000 people around the world, and possibly more, can and do access the enormously flexible communication Internet medium.
The World Wide Web (“Web”) is the best-known and most popular way of using the Internet. The Web comprises an epic assortment of displayed documents, which can contain text, images, sound, animation, moving video, and any other conceivable multimedia. Consistent with the decentralized essence of the Internet, documents on the Web are not collected in any central location; rather, they are stored on servers around the world running Web server software. To gain access to the content available on the Web, a user must have a Web browser—client software such as Netscape's NAVIGATOR® or Microsoft's INTERNET EXPLORER®, which are capable of displaying documents formatted in hypertext markup language (“HTML”), the standard Web formatting language. Each document has an address, known as a Uniform Resource Locator (“URL”), identifying, among other things, the server on which it resides. Most documents also contain “hyperlinks” —highlighted text or images that, when selected by the user, permit him or her to view another, related Web document. Because Web servers are linked to the Internet through a common communications protocol, known as hypertext transfer protocol (“HTTP”), a user can move seamlessly between documents, regardless of their physical location. When a user viewing a document located on one server selects a link to a document located elsewhere, the browser will automatically contact the second server and display the linked document.
Many laypeople erroneously believe that the Internet is coextensive with the Web. The Web really is a publishing forum that is a subset of the Internet; it is comprised of millions (inevitably soon to be billions) of separate Websites that display content provided by particular people or organizations. Thus, when reference is made herein to the Internet, such reference includes the Web, whereas reference to the Web does not include other parts of the Internet. The Web is thus comparable, from the readers' perspective, to both a vast library including millions of readily available and indexed publications and a sprawling mall offering goods and services. From the publishers' perspective, it constitutes a vast platform from which to address and hear from a global audience of millions of readers, viewers, researchers, and buyers. Any person or organization with a computer connected to the Internet can “publish” information. As used herein, the term “publish” means to make content available to the public at large by posting it on the Internet. Publishers include government agencies, educational institutions, commercial entities, advocacy groups, and individuals. Publishers may either make their material available to the entire pool of Internet users or confine access to a selected group, such as those willing to pay for the privilege.
Web standards are sophisticated and flexible enough that they have grown to meet the publishing needs of many large corporations, banks, brokerage houses, newspapers, and magazines, which now publish “online” editions of their materials, as well as government agencies, and even courts, which use the Web to disseminate information to the public. At the same time, Web publishing is simple enough that thousands of individual users and small community organizations are using the Web to publish their own personal “home pages,” the equivalent of individualized newsletters, brochures, catalogs, etc., about the person or organization, which are available to everyone on the Web. Publication on the Web simply requires placing a formatted file on a host computer.
For commercial users, the Web is the most important part of the Internet. Unlike previous Internet-based communications formats, the Web is easy to use for people inexperienced with computers. Information on the Web can be presented on pages of text and graphics (“Web pages”) that contain hyperlinks to other Web pages—either within the same set of data files (“Website”) or within data files located on other computer networks. Users access information on the Web using browsers, which process information from Websites and display the information using graphics, text, sound, and animation. Because of these capabilities, the Web has become a popular medium for advertising and for direct consumer access to goods and services.
Commerce is one area in which the Internet is changing all the rules. The commercial use of the Internet tests the limits of traditional, territorial-based commercial law. The Internet knows no boundaries. To paraphrase Gertrude Stein, as far as the Internet is concerned, not only is there perhaps “no there there,” the “there” is everywhere there is Internet access—essentially anywhere on the globe. When business is transacted over a computer network via a Website accessed by a computer in, for instance, Massachusetts, it takes place as much in Massachusetts, literally or figuratively, as it does anywhere else.
This revolutionary change is highly significant. Physical boundaries typically have framed legal boundaries, in effect creating signposts that warn that we will be required after crossing to abide by different rules. But the strength of the Internet is chaos (the essential absence of central control), which defies most conventional notions of boundaries. To impose traditional territorial concepts on the commercial use of the Internet has dramatic implications, opening the Web user up to inconsistent regulations throughout fifty states, indeed, throughout the globe. It also raises the possibility of dramatically chilling what may well be the most participatory marketplace of mass speech that this country—and indeed the world—has yet seen.
As noted above, the ease with which information can be published over the Internet and the concomitant ease with which such information can be updated, modified, or deleted create a novel set of disadvantages for those who use the Internet for information publication purposes. For example, information concerning what was published, when it was published, and where it was made available can be important to, and sometimes determinative of, issues such as, for instance, pricing and other disputes involving advertisements published over the Web and personal jurisdiction over the Web publishing entity, to name just a few. In such cases, it is critical for the Web publisher to maintain regular and accurate records of the publications it publishes on the Web.
As used herein, the term “Software Configuration Management” (“SCM”) means the process of identifying, defining, recording and reporting the configuration of items in a system and the change requests. SCM also means controlling the releases and changes of the items throughout the life cycle of a Web page. The term SCM is used herein synonymously with the term “Web page change tracking” and variations thereof.
One commonly used, commercially available Web page change tracking tool is the CLEARCASE® family of software sold by Rational Software Corporation of Cupertino, California. CLEARCASE® and similar tools track editing of Web pages so the developer of the Web page knows what changes have been made relative to previous versions of the Web page, but, as far as the present inventor is aware, CLEARCASE® does not provide detailed information about when a particular version of the Web page was published over the Web. As such, CLEARCASE® is not useful for generating the regular, accurate records of Web page content that comprise one of the principal advantages of the present invention.
Another approach to Web page change tracking is described in U.S. Pat. No. 6,029,175, issued Feb. 22, 2000, to Chow et al. Chow describes an intelligent network agent, referred to as a Revision Manager, which provides notification to Web users of changes to designated Web pages. The Revision Manager of Chow is interposed between standard HTTP browsers and HTTP servers. Chow's Revision Manager monitors designated Web pages and, when changes are detected, save the modified document to a central cache that is accessible to many users of the Revision Manager. Chow's Revision Manager is principally directed to notifying Web users of modifications to Web pages of interest, but Chow does not teach record keeping of the times and dates during which each version of a Web page was published over the Web.
Yet another set of approaches to Matthew Freivald and others working in association with NetMind Technologies, Inc., of San Jose, Calif., have developed Web page change tracking. Frievald et al. have obtained a series of U.S. patents relating to these change tracking technologies, which are discussed generally and specifically below.
In general, Freivald tracks Web page changes through periodic polling of the Web pages to be tracked. Freivald's tracking of web pages is directed to use by a Web page user/reader. A Web page user registers the URL of a Web page of interest and provides an e-mail address for notification of changes. The Web page user can also specify sections of the subject Web page and other parameters with respect to which he or she wants notification of changes. The Web page at each registered URL is periodically retrieved and a signature for that version of the page is generated. The signature is stored in a history table so that each time the Web page is retrieved and the signature is generated, the signature can be compared against the other signatures stored in the history table in order to determine whether changes have occurred and whether such changes meet the user's parameters for notification. The advantages of Freivald's tracking are two fold: (1) little storage space is required because only the signatures of the Web pages are stored rather than the entire Web pages; and (2) the user only receives notification of changes that meet his or her specified parameters, so he or she is not overwhelmed with notifications of relatively unimportant changes.
U.S. Pat. No. 5,898,836, issued Apr. 27, 1999, to Freivald et al., describes Freivald's basic invention in which the signatures of the Web pages are generated by a change-detection server using a “Cyclic-Redundancy-Check” (“CRC”) checksum procedure.
U.S. Pat. No. 5,978,842, issued Nov. 2, 1999, to Noble and Freivald, describes Freivald's invention wherein the detection of changes is performed by a client-side change-detection application downloaded and installed on the computers of users. As more users are registered for a Web page, change detection is performed more frequently.
U.S. Pat. No. 5,983,268, issued Nov. 9, 1999, to Freivald et al., describes the user-interface of the change-detection tool comprising a spreadsheet displayed to the user in which the user can specify his or her notification parameters by entering parameters, formulae, etc., in the spreadsheet and the user's formulae are applied to fields retrieved from the subject Web page and automatically entered in the spreadsheet, whereby the determination is made as to whether or not to notify the user of a change.
U.S. Pat. No. 6,012,087, issued Jan. 4, 2000, to Freivald et al., describes a change-detection tool which monitors the frequency of e-mail notifications sent to a user and, if the user is receiving too many e-mail notifications, the invention uses criteria based on HTML header information, rather than a checksum signature, to determine whether or not to send the user a change notification e-mail.
U.S. Pat. No. 6,219,818, issued Apr. 17, 2001, to Freivald et al., describes a change-detection web server in which Web pages are divided into HTML-bounded sections, and the user is enabled to specify that he or she only wants to be notified of changes occurring in certain HTML-bounded sections.
Freivald's Published U.S. patent application Ser. No. 20,020,013,825, published Jan. 31, 2002, describes a change-detection tool in which the user is only notified when new, unique content appears on the subject Web page. Detected changes are compiled into a periodic report that is sent to the user. In addition, user profile information is collected.
Freivald's inventions do not, however, provide a Web publisher a regular, accurate record of when each version of a Web page has been published. Instead, Freivald's inventions are primarily directed to notifying Web users of changes to Web pages in which such users have an expressed interest. Indeed, one of the main advantages of Freivald's change-detection tool is that the amount of storage space required is minimized by purposely not recording each version of the subject Web page. Therefore, Freivald's inventions are not useful to a Web publisher for whom it is critical to have a true and correct copy of each version of its Web page, including information concerning the times and dates during which each version of the Web page was published on the Web.
It is to be understood that numerous means of monitoring Web pages for changes are known now and many more undoubtedly will be developed in the future. The Chow and Freivald patents discussed above describe change detection for web pages. Additional Web page change detection methods and apparatus are described in U.S. Pat. No. 6,119,124, issued Sep. 12, 2000, to Broder et al., U.S. Pat. No. 6,324,555, issued Nov. 27, 2001, to Sites, and Ohkado et al.'s Published U.S. patent application Ser. No. 20,010,016,873, published Aug. 23, 2001.
It is apparent from the foregoing that a need exists for improved published Web page version tracking. Specifically, a need exists in the art for improved ways to track the content of Web pages to provide accurate information concerning the time period of publication of each version of a Web page.