In recent years there has been tremendous growth in data communication networks such as the Internet, Intranets, Wide Area Networks (WANs) and Metro Area Networks (MANs). These data communication networks offer tremendously efficient means of organizing and distributing computerized data, which has resulted in their widespread use for both business and personal applications. For example, the Internet is now a common medium for operating online auctions, academic and public forums, distributing publications such as newspapers and magazines, supporting business communications, performing electronic commerce and electronic mail transactions, and offering government services.
However, the tools needed to offer and support such services have not kept pace with the growth and demand. The Internet is now pervasive in industrialized countries, and it is a necessity for any large organization to have an Internet presence. Some large corporate and government agencies, for example, maintain Web sites with millions of Web pages, whose content changes daily; yet they do not have the tools to efficiently manage this massive data system.
Before discussing the specific nature of these problems, it is necessary to set up the framework for discussion.
FIG. 1 presents an exemplary layout of an Internet communications system 30. The Internet 32 itself is represented by a number of routers 34 interconnected by an Internet backbone 36 network designed for high-speed transport of large amounts of data. Users' computers 38 may access the Internet in a number of manners including modulating and demodulating data over a telephone line using audio frequencies which requires a modem 40 and connection to the Public Switched Telephone Network 42, which in turn connects to the Internet 32 via an Internet Service Provider 44. Another manner of connection is the use of set top boxes 50 which modulate and demodulate data onto high frequencies which pass over existing telephone or television cable networks 52 and are connected directly to the Internet via Hi-Speed Internet Service Provider 54. Generally, these high frequency signals are transmitted outside the frequencies of existing services passing over these telephone or television cable networks 52.
Web sites are maintained on Web servers 37 also connected to the Internet 32 which provide content and applications to the User's computers 38. Communications between user's computers 38 and the rest of the network 30 are standardized by means of defined communication protocols.
FIG. 1 is a gross simplification as in reality, the Internet consists of a vast interconnection of computers, servers, routers, computer networks and public telecommunication networks. While the systems that make up the Internet comprise many different varieties of computer hardware and software, this variety is not a great hindrance as the Internet is unified by a small number of standard transport protocols. These protocols transport data as simple packets, the nature of the packet contents being inconsequential to the transport itself These details would be well known to one skilled in the art.
While the Internet is a communication network, the World Wide Web (www or simply “the Web”), is a way of accessing information over the Internet. The Web uses the HTTP protocol (one of several standard Internet protocols), to communicate data, allowing end users to employ their Web browsers to access Web pages.
A Web browser is an application program that runs on the end user's computer 38 and provides a way to look at and interact with all the information on the World Wide Web. A Web browser uses HTTP to request Web pages from Web servers throughout the Internet, or on an Intranet. Currently most Web browsers are implemented as graphical user interfaces. Thus, they know how to interpret the set of HTML tags within the Web page in order to display the page on the end user's screen as the page's creator intended it to be viewed.
A Web page is a data file that generally contains not only text, but also a set of HTML (hyper text markup language) tags that describe how text and images should be formatted when a Web browser displays it on a computer screen. The HTML tags include instructions that tell the Web browser what font size or colour should be used for certain contents, or where to locate text or images on the Web page.
The Hypertext Transfer Protocol (HTTP) is the set of rules for exchanging files on the World Wide Web, including text, graphic images, sound, video, and other multimedia files. HTTP also allows files to contain references to other files whose selection will elicit additional transfer requests (hypertext links). Typically, the HTTP software on a Web server machine is designed to wait for HTTP requests and handle them when they arrive.
Thus, when a visitor to a Web site requests a Web page by typing in a Uniform Resource Locator (URL) or clicking on a hypertext link, the Web browser builds an HTTP request and sends it to the Internet Protocol address corresponding to the URL. The HTTP software in the destination Web server receives the request and, after any necessary processing, the requested file or Web page is returned to the Web browser via the Internet or Intranet.
The Web is just one of the ways that information can be disseminated over the Internet. The Internet also supports other communication services such as e-mail, Usenet news groups, instant messaging and FTP (file transfer protocol).
A Web site is a collection of Web pages that are organized (and usually interconnected via hyperlinks) to serve a particular purpose. An exemplary Web site 60 is presented in the block diagram of FIG. 2. In this example, the Web site includes a main page 62, which is usually the main point of entry for visitors to the Web site 60. Accordingly, it usually contains introductory text to greet visitors, and an explanation of the purpose and organization of the Web site 60. It will also generally contain links to other Web pages in the Web site 60.
In this example, the main page 62 contains hypertext links pointing to three other Web pages. That is, there are icons or HTML text targets on the main page 62, which the visitor can click on to request one of the other three Web pages 64, 66, 68. When the visitor clicks on one of these hypertext links, his Web browser sends a request to the Internet for a new Web page corresponding to the URL of the linked Web page.
Note that the main Web page 62 also includes a “broken link” 70, that is, a hypertext link which points to a Web page which does not exist. Clicking on this broken link will typically produce an error, or cause the Web browser to time out because the target Web page cannot be found.
Web page 64 includes hypertext links which advance the visitor to other parts within the same Web page 64. These links are referred to as “anchors”. Accordingly, a hypertext link to an anchor which does not exist would be referred to as a “broken anchor”.
Web page 66 includes links to data files. These data files are shown symbolically as being stored on external hard devices 72, 74 but of course they could be stored in any computer or server storage medium, in any location. These data files could, for example, contain code and data for software applications, Java applets, Flash animations, music files, images, or text.
There is no limit to the number of interconnections that can be made in a Web site. Web page 68, for example, includes links to four other Web pages 76, 78, 80, 82, but it could be linked to any number of other Web pages. As well, chains of Web pages could also be linked together successively, the only limit to the number of interconnections and levels in the hierarchy being the practical considerations of the resources to store and communicate all of the data in the Web pages.
As noted above, Web sites may have many, many pages. A large corporation or government, for example, may have to administer millions of Web pages which are almost constantly changing. This makes it extremely difficult for the Web site administrator to ensure that there are no content issues in the Web site, such as broken links. Tools do exist for analysing Web sites and locating such content issues (referred to herein as “content scanning”) but in a very large Web site, the amount of data with content issues may still be unmanageable.
Suppose for example, that an error caused approximately one thousand Web pages on a particular Web site to fail. Running a content scan would identify the one thousand Web pages with content issues, but this would be of little assistance to the Web administrator. It would still take a tremendous amount of human resources to investigate each reported content issue and correct each Web page. In the meantime, visitors would not be able to find the Web pages they are looking for, and the Web site would operate in an unpredictable and ineffective manner. These content issues on a corporation's Web site could cause material losses, either due to liability incurred or lost business. Thus, while the content scan would help identify the problems, it would be of little assistance in resolving them; it would still take a long time before the Web site would be effective at all.
There is therefore a need for a means of making the analysis and correction of data distribution systems over the Internet and similar networks, much more practical and effective. Such a system should be provided with consideration for the problems outlined above.