1. Field of the Invention
The invention relates to client-server data communication systems. More particularly, it relates to repairing links between pages in a client-server network.
2. Description of the Related Art
The Internet, as it is popularly known, has become an important and useful tool for accessing a wide variety of information. One component of the Internet is the World Wide Web (hereinafter the web). In recent years the web has become an increasing popular vehicle for providing information to virtually anyone with access to the Internet. Many websites have been established to provide, over the web, information in many different forms, such as text, graphics, video and audio information.
A typical website is hosted on a network server computer that includes application software programs. The server, also known as a web server, is connected to the Internet. By connecting the web server to the Internet, clients that are connected to the Internet can access the website via the web server. Usually, a client is located remotely from the web server, although the client and the server can be at the same location. A web server also can be connected to a private intranet, as opposed to or in addition to the public Internet, in order to make a website privately available to clients within an organization.
A client-server communication system used on the web is shown on FIG. 1. The system includes a client 2 that is connected to a monitor 4 and to a network 6, such as the Internet. The client sends and receives messages over the network 6 to a server, such as web server 8, web server 10, or web server 12 shown in FIG. 1. The web servers host web sites that include one or more web pages. For example, web server 8 hosts web site 13 that contains a series of web pages 14a through 14n. Similarly, web server 10 hosts web site 15 that contains a series of pages 16a through 16n, and web server 12 hosts web site 17 that contains a series of pages 18a through 18n. A uniform resource locator (URL) is a string that gives information about the location of a particular resource (such as a file, image, or program) on the Internet. Generally, each web page has a unique URL.
The client 2 accesses web pages on a website by using a web browser 20; that is, a software program that runs on the client and receives from the server information formatted in a known manner. A very popular format for information sent over the web from a server to a client is the Hyper Text Mark-up Language (HTML).
A web server typically takes user input, in the form of a URL, and returns the file(s) that correspond to that web page. This process begins by the client browser sending a request to a web server indicated by the URL. Once the web server receives the client request, it locates the file, or executes the program, specified by the URL and sends the file back to the client browser. The file(s) making up the web page that has been delivered to the client is held in a cache memory for use by the browser 20. Web page 22 shown in FIG. 1 represents a web page that is stored in the browser's cache. The browser interprets the HTML code in the web page to generate a display 24 on monitor 4. If the web server encounters a problem while processing the client browser's request, it returns an error code.
One web page on the Internet can reference another web page on the Internet through the use of URL links. These links are basically URL strings contained within special HTML tags. When the user clicks on such a link the client browser requests from a web server the resource specified by the URL and displays that resource, such as an HTML web page file, on the client browser. Here, for purposes of illustration, referring to FIG. 1, assume the web page 22 held in the browser's cache and displayed as page 24 came from web site 13 in web server 8. The web page display 24 includes two hypertext links to other web pages held on different severs. URL link B (“link B”) 26 contains the URL of web page 16 a stored in web server 10. URL link C (“link C”) 28 contains the URL of web page 18 a stored in web server 12. If a user selects link B 26, the browser sends a message to the web server 10 to return the web page corresponding to the URL of link B. Here, web server 10 returns an HTML copy of web page 16a to client 2. Similarly, if a user selects link C 28, the browser sends a message to the web server 12 to return the web page corresponding to the URL of link C. Here, web server 12 returns an HTML copy of web page 18a to client 2.
A commonly encountered problem with many web pages is that the hypertext links on those pages might become stale, or broken, such that the URL within the hypertext link no longer refers to the location of a web page. The problem of broken links, also known as linkrot, occurs commonly on sites and pages throughout the web. Web surfers find broken links to be annoying and usually tend to avoid sites that have many broken links. For web page authors, fixing broken links can be tedious and labor intensive.
A URL link can be considered to be broken when, for example
1. The file specified by the URL has been renamed in the web server.
2. The file specified by the URL has been deleted in the web server.
3. The location of the file under the web browser is changed.
Under any one of these circumstances the web server returns an error message (e.g., error code 404) back to the client browser.
Broken links are very annoying to the users and are quite common on the World Wide Web. A 1997 World Wide Web user survey rated broken links to be the most frequent problem encountered by users.
Fixing broken links is a significant inconvenience for web developers. It is a task that is carried out manually, and hence, is labor intensive and time consuming. Despite the fact that broken links are regarded as one of the most serious problems on the World Wide web, no definitive solutions that solve the problems once and for all has yet been developed.
Proposed solutions to date are difficult to implement and do not operate automatically. One such solution recommends web developers follow rules, set forth below, to prevent broken links.
1. Check the web page links frequently and fix them to reduce outbound linkrot.
2. Keep old pages on the server forever and if moving pages place a redirect link on the old page.
Web developers often either are not aware of such rules or simply do not follow them, as illustrated by the large number of broken links on the World Wide Web. Accordingly, there is a long felt but as of yet unsolved need to automatically detect and fix broken links.