This invention relates to Internet technology, and more particularly to the maintenance of a repository of summary data containing document locators such as uniform resource locators (URLs) for example one associated with a search engine.
To conduct a search on the Internet, a user typically queries a search engine (such as Yahoo, Hotbot, etc) to find a desired piece of information, which is contained in a document stored on a web site. The search engine typically does not keep copies of documents, but instead keeps an indexed repository of summary data (also called metadata) containing links (also called hyperlinks, or uniform resource locators (URLs)) to the documents. The summary data is generated by using a gatherer to xe2x80x9ccrawlxe2x80x9d a web site and analyze its content.
When the user queries a search engine, the search engine returns a list of results matching the query terms. Each result contains the URL for a document as well as an abstract. The user then clicks on the URL to go to the document. When the search engine repository is out of date, the URLs presented in the results may not be valid, and in such a case an error code is returned. There are several reasons why URLs may not be valid, for example the document may no longer exist or it may have been moved to another location. This problem of an invalid URL is often known as a xe2x80x9cbroken link.xe2x80x9d
Broken links are frustrating to the user, as it often takes a significant amount of time for the error message to be returned. If a user is frustrated enough times, he or she will likely become dissatisfied with the search engine and use a different one. Thus the quality of a search engine can be measured by how up to date it is, or put another way, what percentage of broken links it has in its repository.
Web sites change at a rapid pace. Because the web sites have no control over where the summary data of their documents is stored, there is no way of notifying anyone about any changes. Thus search engines currently maintain their repositories by periodically recrawling their resources (i.e. the web sites which they have summarized). One way to maintain a repository by recrawling is described in U.S. Pat. No. 5,855,020 to Kirsch. Kirsch monitors dynamic network feed for new URLs and validates them prior to adding to a repository. He also revalidates the repository by periodically assessing the validity of each URL, with the period determined by an associated volatility for each URL. This recrawling is costly in terms of time and resources, and thus cannot be performed often enough to keep up with the rapid pace of change.
Thus it is desirable to have a way to reduce or avoid URL maintenance recrawling by dynamically detecting and eliminating broken links in order to maintain a search engine (or other Internet) data repository.
A method of maintaining a repository of summary data about documents associated with document locators, the repository of summary data stored separately from the documents and containing the document locators, when a user requests a document associated with one of the document locators, by requesting the document associated with the document locator; receiving a result based on the request; analyzing the result to determine the validity of the document locator; and requesting an update of the repository based on the validity of the document locator is described. In a network environment, the document locators are uniform resource locators, or URLs. The repository update may take the form of deleting the URL from the repository, or moving it to another location for further examination.