The field of the present disclosure relates generally to managing website content, and more particularly, to methods and systems for detecting orphan content within websites.
Many entities create and maintain a network of client systems and server systems to facilitate transferring electronic files from the server systems to the client systems. Typically, a user accesses electronic files stored by the server systems using a network-enabled client system, for example, a computer. The content stored by the server systems may include electronic documents, electronic files, and/or other forms of electronic data. A hierarchical system of directories is often used to organize the content. The content may also be organized and accessible to users by interlinking the content, for example, using a website. The website is a collection of web pages that can be accessed and viewed using a web browser. Typically, client systems include web browsers. When accessed by the web browser, the web pages display information for use by users who are allowed access to the network and facilitate interaction between the client system and the server system.
For example, many organizations create and maintain an internal organization website for use by users, for example, employees, contractors, and vendors. Each department of a multi-department organization may create and store electronic documents on the server system. Those documents may be organized and made accessible to the users by adding one or more web pages to the organization website that include links to the documents. Some examples of organization websites are known to include thousands of web pages and thousands of documents, interconnected by tens of thousands of links. Typically, each web page and each electronic document exists as a separate entity, which is each identified by a unique address on the network called a Uniform Resource Locator (URL). Embedded within a first web page may be a link to a second web page or to a document. In this example, the first web page is referred to as a source file and the second web page is referred to as a target file. More specifically, the link embedded in the source file includes a URL which points to the target file. If the link is functioning properly, when the user selects the link while viewing the source file using the client system, the user is then provided with the target file via the client system.
If the target file has been moved or added to the server system or if the URL of the target file has been changed there may be no link that points to the target file. In such a case there is no way to access the data in the target file because no link is available to permit a user to select to access the target file. Orphan content causes user frustration and work-place inefficiency. When a target file is inaccessible, the value of the target file is reduced due to the inability to access the information contained in the target file. Locating the orphan content within a website allows links to be created, either by editing the source files or by changing the URL of the target files to match the source file links.
Software is currently available for checking the validity of hypertext links embedded within web pages. Typically, a spider technology is used to “crawl” an intranet or Internet web. Spider software is initialized by a user to begin on a certain web page (i.e., a first active web page). The software parses the first active web page for a link. Once the software identifies the link, the software selects the link, closing the first active web page and opening the target web page, which becomes a second active web page. The software begins to parse the second active web page to identify a link. Once a link is found in the second active web page, the software selects the link and the target web page associated with the link becomes a third active web page. The software operates under an assumption that the web pages being analyzed are sufficiently interconnected to ensure the software parses all of the web pages. For this reason, crawling between web pages upon identification of a link does not ensure that all web pages are parsed, and also does not ensure that all links within each web page are analyzed.
It would be desirable for users to be able to automatically detect orphan content in a large data tree structure on a regular basis to ensure efficient utilization of memory and computing resources.