The field of the present disclosure relates generally to detecting broken links between source pages and target pages, and more particularly, to a system configured to analyze all links included within source pages located in a selected directory.
Many corporations create and maintain a network of client systems and server systems to facilitate transferring electronic files from the server systems to the client systems. Typically, a user accesses electronic files stored by the server systems using a network-enabled client system, for example, a computer. The content stored by the server systems may include electronic documents, electronic files, and/or other forms of electronic data. A hierarchical system of directories is often used to organize the content. The content may also be organized and accessible to users by interlinking the content, for example, using a web site. The web site is a collection of web pages that can be accessed and viewed using a web browser. Typically, client systems include web browsers. When accessed by the web browser, the web pages display information for use by users who are allowed access to the network and facilitate interaction between the client system and the server system.
For example, many corporations create and maintain an internal corporate web site for use by users, for example, employees, contractors, and vendors. Each department of a multi-department corporation may create and store electronic documents on the server system. Those documents may be organized and made accessible to the users by adding one or more web pages to the corporate web site that include links to the documents. Some examples of corporate web sites are known to include thousands of web pages and thousands of documents, interconnected by tens of thousands of links. Typically, each web page and each electronic document exists as a separate entity, which is each identified by a unique address on the network called a Uniform Resource Locator (URL). Embedded within a first web page may be a link to a second web page or to a document. In this example, the first web page is referred to as a source file and the second web page is referred to as a target file. More specifically, the link embedded in the source file includes a URL which points to the target file. If the link is functioning properly, when the user selects the link while viewing the source file using the client system, the user is then provided with the target file via the client system.
The link will not function properly if the target file has been removed from the server system or if the URL of the target file has been changed. Typically, if a non-functioning link is selected, the user will receive an error message at the client system. A link embedded within a source file that does not connect the source file to the target file is referred to herein as a “broken link.” Broken links cause frustration and work-place inefficiency. When a target file has been removed, replaced, altered, or moved without updating the source file links that reference the target file, the value of the target file is reduced due to decreased access to the target file. Locating the broken links within a web site allows the broken links to be repaired, either by editing the source files or by changing the URL of the target files to match the source file links.
Software is currently available for checking the validity of hypertext links embedded within web pages. Typically, a spider technology is used to “crawl” an intranet or Internet web. Spider software is initialized by a user to begin on a certain web page (i.e., a first active web page). The software parses the first active web page for a link. Once the software identifies the link, the software selects the link, closing the first active web page and opening the target web page, which becomes a second active web page. The software begins to parse the second active web page to identify a link. Once a link is found in the second active web page, the software selects the link and the target web page associated with the link becomes a third active web page. The software operates under an assumption that the web pages being analyzed are sufficiently interconnected to ensure the software parses all of the web pages. For this reason, crawling between web pages upon identification of a link does not ensure that all web pages are parsed, and also does not ensure that all links within each web page are analyzed.