Internet usage has increased dramatically in the past few years and, as a result, the usage of hypertext documents that contain hyperlinks has also increased dramatically. Hyperlinks provide an address path to access content that is related to or associated with text, graphics, video, audio, etc. Typically, a user utilizes a hyperlink by selecting the hyperlink with a mouse. Specifically, the user positions a mouse cursor over the hyperlink (which is underlined, highlighted, displayed in a different color or otherwise distinguished in a manner that indicates that it is a hyperlink) and clicks the mouse to retrieve content accessible through the hyperlink. Upon the user selection of the hyperlink, a web browser program accesses the address path of the content that is referenced by the hyperlink. The address path of the content is usually represented by a uniform resource locator, or URL. The browser retrieves the content located at the location indicated by the URL, and renders the content to the user. For most web pages, this entails displaying video information on a video display for a web page. Audio information may also be retrieved and output. Hyperlinks may also initiate a transfer of content to the user""s computer through file transport protocol (i.e., FTP) or other sites.
FIG. 1 is a block diagram that illustrates the basic scheme that is employed in retrieving such content with a conventional web browser 110. The web browser 110 is run on a client computer system 112. The web browser 110 is used to generate a request 114 for the content from a server computer system 116. Typically, this request 114 is a GET request that complies with the hypertext transfer protocol (HTTP). It is understood that server 116 may provide the content or server 116 may access one or more additional servers to retrieve the content. For example, server 116 may be a local server which a user contacts to access the Internet. A second server may then be accessed to provide the content.
The server computer 116 receives the request 114, accesses the content 118 stored therein, and returns a copy of the content 120 to the client computer system 112. The content may be a new web page or may be a file obtained through an FTP process, or a similar process. The web browser 110 includes code for rendering the content 120 so that the content is output to the user. Typically, for a web page, the copy of the content 120 is forwarded as a hypertext markup language (HTML) content. The HTML content may contain a number of hyperlinks that enable the user to gain access to other web sites.
Throughout the following discussion, for the purpose of clarity, conventional hypertext terminology will be used with respect to hyperlinks and the content they refer to, as opposed to the more general object terms of resources and references. However, as would be understood by those of skill in the art, the method of the present invention will work for any relationship between any objects.
Hypertext systems are normally window-based, and newly displayed content, in the form of documents for instance, generally appear in windows on the user""s display. The new content will often contain more hyperlinks to other content. By following hyperlinks, the user is said to xe2x80x9cnavigate.xe2x80x9d Hyperlinks present a powerful means to navigate within entire networks, and Internet navigation through the use of hyperlinks embedded within hypertext content is a well-established technology. While viewing hypertext content, the user can exercise a great deal of control over the order in which information is presented as well as play a very active role in selecting how far to pursue a given topic. Hyperlinks found within a web site can link to other content within that web site or to content located at remote sites.
A web site, which can have many hundreds of web pages linked together and to outside content with hyperlinks, is typically organized and maintained by an administrator. A web site administrator is often called a xe2x80x9cwebmaster.xe2x80x9d Webmasters are responsible for, among other things, the accuracy of the hyperlinks embedded in the content on their web sites.
Problems arise in web site administration when hyperlinks fail to connect web site users to the expected target content. One difficulty encountered with hypertext content is that the hyperlinks embedded within the content may be unresolvable (i.e., not resolved to a web site). This typically results in the user receiving an xe2x80x9cError 404xe2x80x9d or similar message. This message can appear on a user""s screen when the user clicks on a hyperlink that fails to direct the user to content, but in any case the user""s browser is notified of the error. In other words, the hyperlink directs the user to a URL that does exist, and is therefor unresolvable or xe2x80x9cbroken.xe2x80x9d Causes for such unresolvable hyperlinks include incorrectly configured hyperlinks (e.g., containing a typographical error), or, much more commonly, a change or deletion of the storage location of the content without a concurrent update of the hyperlinks contained within the referring page. In such cases, the web browser of the user returns an error message because content is no longer located at the address path (URL) specified by the hyperlink. As a result, the user is unable to access the content referenced by the hyperlink.
Similar difficulties may be encountered in different environments. Other references to objects or files may also be subject to changes that makes them unresolvable. For example, hyperlinks and path names that refer to other files or objects may change. These references may also be, for example, object identifiers (object IDs) or other types of signatures that uniquely identify a file or object holding text or other media, such as audio data or video data. Unfortunately, such an object identifier, path name, or resource identifier may not be current. As a result, access to the resource may not be possible. Other errors may be returned to the user when a web site server is temporarily inoperable or overloaded. If web site users frequently click on hyperlinks that return error codes, then the web site will likely engender anxiety among its users, and could lead to disuse of the web page by the user. In any of the above cases, the hyperlink that returns an error when activated is considered to be a xe2x80x9cbrokenxe2x80x9d link.
A related problem arises when a user clicks on a hyperlink which is, in fact, active, but which does not provide the expected content because the hyperlink has been reassigned to a new owner or the owner has substantially revised the content. A second related problem is when a link points to a site that has xe2x80x9cmovedxe2x80x9d to a new URL, forcing the user to click one or more additional times to gain the desired access. In addition, instead of an error message, a server may provide content stating that the page could not be found and recommend actions to the user. None of these scenarios generates an error message, but all frustrate the user by providing content that is not desired by the user and/or delaying access to the desired content.
Currently, a webmaster verifies hyperlinks by either manually checking each hyperlink or being notified of broken hyperlinks through user reported errors. Manually checking for broken hyperlinks is time consuming and subject to human error. Relying on user reported errors will not result in effective hyperlink verification due to low user reporting rates. What is needed in the art is a method or system with which a webmaster can verify with ease and confidence that the content referenced by the web site""s hyperlinks are retrievable. That is, a method or system for verifying that a hyperlink will not cause an error code to be returned to the user. Additionally, since webmasters will frequently not want the referenced content of their web site""s hyperlinks to be altered even if the link is an active (not broken) link, a method for verifying that active links refer to content that is consistent with the webmaster""s and the users"" expectations and is needed in the art.
An exemplary embodiment of the invention is a method for verifying hyperlinks on a web site. The method includes generating a hyperlink database including a plurality of hyperlinks and a uniform resource locator associated with each hyperlink. An Internet browser application is initiated and the Internet browser application attempts to retrieve content in response to the uniform resource locator. A presence or absence of an error is detected in retrieving the content. A web site administrator is notified of the results and errors are reported.