Documents are stored in electronic form in storage repositories, which can be physically located at many different geographic locations. Each document has a label, or name intended to uniquely identify the document.
With the Internet, and/or other computer networks, computer users are able to access the documents via one or more network servers. Tools, such as search engines, are available to the user to search for and retrieve these documents. A search engine typically uses a utility referred to as a crawler, to locate stored documents. Results of one or more “crawls” can then be used to generate an index of documents, which can be searched to identify documents that satisfy a user's search criteria. In a case of the web, a resource, which includes a file containing a document, has a universal resource locator, or URL. Each URL conforms to a known format, or syntax, and is intended to uniquely identify the file.
A crawler typically searches web sites to locate resources, and returns a listing of resources identified from a given web site. The crawler typically does not return the document until a later stage, at which point the document is fetched. The results of a crawl can include documents that may or may not be useful. A typical crawler returns each file that it finds without regard to the contents of the file. An index that is created from the results of the crawl would then include each document identified by the crawler, and a search that is conducted from the index could contain one or more documents identified during the crawl. In addition and in a case that copies of the files/documents identified during the crawl are archived, documents identified during the crawl are saved. There is a significant impact and drain on resources, with significant impact on storage, bandwidth, processing, etc., to fetch, index, archive and search crawled resources, for example.
As a further illustration and from the user's standpoint, the impact can be felt by the user that conducts a search. A search typically involves a user who enters search criteria, which typically includes one or more search terms, and a server, or other computer system, which receives the search criteria and generates a set of results that are returned to the user for review. More particularly and in response to the request, the server uses the above-discussed index to identify the set of results to be returned to the user. In effect, the burden of reviewing the documents is placed on the user. Furthermore, the user must use the server's resources as well as network resources to retrieve the documents for review.