1. Field of the Invention
The present invention relates to a network quality control system that performs an automatic validation of World Wide Web pages and other hypertext documents. Particularly, the present invention relates to a software system and associated method for the automatic validation and repair of web pages, the automatic identification of web page authors using a probabilistic approach, and the automatic notification to the web page authors of the structural errors in their web pages.
2. Description of Related Art
The World Wide Web (WWW) is an open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. WWW (or web) documents are traversed in segments related to one another using hypertext links otherwise known as “links”. Hypertext links allow a user to view available digitally encoded document information in a non-sequential manner. Using hypertext links the user can jump from one location in a document to another location, document, or web site.
With the explosive growth of the WWW, users increasingly create their own web pages without resorting to professional assistance. As a result, the number of web authors who lack familiarity with WWW standard language specifications also increases. These web authors are often unaware of the importance of producing valid documents or even of the existence of standard WWW specifications (such as the language specifications for HTML). Hence, a significant number of HTML pages on the WWW do not conform to the published standards (refer for example to the following site http://www.w3.org). These documents contain structural errors. A few exemplary factors that contribute to the introduction of errors in the creation of web pages are listed below:                Most web browsers are very forgiving of malformed HTML documents and attempt to render pages regardless of how badly formed they are. These browsers compensate for errors according to their own methods. As a result, viewed through one type of web browser, structurally erroneous HTML documents may appear to achieve the web author's design goals even while not adhering to standard specifications.        An increasing number of web authors use visual tools to automatically generate HTML documents. As a result, the extent to which the document complies with HTML standards depends on the quality of the tool. Unfortunately, these tools range in quality, and many do not adhere to the standard specifications. Moreover, as HTML specifications evolve over time, outdated versions of tools will generate documents using deprecated and hence non-valid HTML features.        
Consequently, a significant number of published web documents contain HTML errors. The pervasiveness of such structural errors in HTML documents limits the utility and versatility of the data contained within them. These structural errors preclude valuable content information from being properly processed by web agents and thereby pose barriers that limit web accessibility. For example, user agents, such as specialized voice browsers for the blind, often cannot adequately parse malformed HTML documents. As a result, they might be unable to properly render these documents for blind users. Similarly, data extraction agents, such as those used by search engines to index web documents, often cannot fully access and process valuable content and metadata information in malformed HTML documents. As a result, the search engines could fail to index the pages optimally. These problems of access are particularly troublesome to web site owners or companies that rely on their web sites to provide critical information and services to customers and business partners.
In an effort to address the problem and to promote the valid use of the HTML in web documents, several methods and systems were proposed and are made available, for example, at the following WWW sites:                http://netmechanic.com;        http://websitegarage.com;        http://www2.imagiware.com/RxHTML;        http://www.w3.org/People/Raggett/tidy;        http://www.submit4less.com; and        http://watson.addy.com.        
However, these HTML validating services require web authors to have a priori knowledge of the existence of, and the value of adherence to the published standards. None of the conventional validation services proactively seeks out malformed web pages over the entire WWW. For web authors to test their documents using these services, the authors must either register their web sites directly with the services or submit each URL manually. Hence, the web authors who lack knowledge of the importance of valid HTML are likely not to use these validation services. As a result, the majority of published HTML documents that contain structural errors will remain erroneous indefinitely without the authors' awareness.
There is therefore a great and still unsatisfied need for a network quality control system that proactively performs an automatic validation of the documents on the WWW, that automatically repairs non-conformant web pages, and that automatically finds the authors and notifies them of errors in their documents.