One of the most prolific hypertext systems in recent years has been the World Wide Web which allows inter linked HTML (Hypertext Markup Language) documents to be transmitted between computers on the Internet using HTTP (Hypertext Transfer Protocol). Each document exists as a separate entity, which can be identified by a unique address on the network called a URL (Uniform Resource Locator). This naming scheme allows for one party to reference to another's work by including a URL which points to the referenced work such that a web site belonging to a first party links to a second party document.
A web site value is measured by the availability, accuracy, relevance and reliability of the page being linked to. When a document on the web site is removed, replaced, altered or moved such value measurements can changed for the worse. Therefore making any change to a web site could have a detrimental effect on the value of the web site and the value of other web sites that link to it.
The problem relates to web site maintenance, specifically of pages which link to documents which subsequently move, change, disappear or get replaced. These interconnecting links form the backbone of the World Wide Web and are often a valuable business tool in forming alliances and cross-promotion.
There is a requirement for web site owners to be able to guarantee that their site is as up-to-date as possible, with invalid links and inappropriate content discovered and repaired quickly.
This is also a more general problem affecting any system which contains links or pointers between items of information, for example, entries in a relational database.
Tools do exist that crawl through HTML documents either locally or over HTTP, reporting broken links. Such a tool indicates to the web site owner that URL document of a particular link is no longer there. These tools do not indicate if the link still points to the same page and cannot give any guidance on whether the information has changed. The tools also do not attempt to resolve broken links or identify new locations for moved content. In the particular case of HTTP, if a web site owner is aware that a document that was linked-to has moved, and they know where it has moved to, they can set up their site so that when the resource is accessed a ‘302 Moved’ response is sent. However, the onus is on the web site owner to find the new location of the page and to manually set up the redirection facility. Also the web site administrator must allow this facility to be set up. A problem for a web site administrator is that the content of the site is owned by someone other than the web site administrator but that complaints about broken links are more likely to come to the web site administrator especially on an intranet.
The problem of broken links is so severe that Google™ (Google is a trademark of Google Technology Inc.) has taken to caching whole pages that people can view if the search result is a broken link. Another solution from Google is to find similar documents for documents located in a search. Although this is not specifically limited to broken links it can be useful when a document is not available due a broken link. ‘Similar documents’ in a Google search means other documents in the same category as the located document and Google specifically excludes very close matches to the located document.
One solution, U.S. patent Publication US2002/0169865, discloses a software agent called Revbot to detect a changed page and then trigger a central resource which reindexes the changed page. Such central resources are typically search engine network nodes. This publication discloses how software agents are installed on the web site's computer platform and are aware of search engines and other qualifying databases and lists located at other nodes. The RevBot can be used to filter, block and enhance web site content. By working in a manner that is the reverse of a search engine, a RevBot is installed on a web site's computing platform and is aware of a search engine located remotely on a network. It transmits data relating to the web site, such as the synopsis of the recently changed content, to the search engine. When a web server changes a document, Revbot will request that the search engine updates its index. This helps the search provider and users of this search engine.
Although the above description relates to a completely broken link, the problem also extends to a link which does not return the internal document.
An object of at least one of the embodiments is to assist an administrator of a web site and content owner in maintaining the integrity of the hyperlinks.
An object of at least one of the embodiments is to locate the information and URL document that the content owner originally intended to link to.
Another object of at least one of the embodiments is to make each fingerprint unique to the content of a URL document not to URLs.
Another object of at least one of the embodiments is to locate the original of moved and altered content automatically whereby such an embodiment can be trusted to maintain a set of documents without manual intervention.
Another object of at least one of the embodiments is to update stored information as frequently as it is configured to do so and to provide information on demand.
Another object of at least one of the embodiments is to verify the state of a web site and guarantee that it is fully functional, accurate and up to date.
Another object of at least one of the embodiments is to protect confidential information with a secure system.