This invention relates to repository security, and more particularly to the protection of searchable repositories such as those found on the Internet.
Internet search engines (e.g. Hotbot, Yahoo) spend a great deal of time and effort developing and maintaining repositories of information stored on their own servers. These repositories contain summary data about network resources such as documents or web pages found on the Internet. The data includes links, also called hyperlinks or uniform resource locators (URLs) that are essentially addresses where the documents can be found, as the documents are not stored in the repositories but on other servers.
The repositories are created and maintained by using a web crawler or gatherer to access documents on a large scale from a large number of servers on the Internet. A crawler or gatherer typically will perform an initial query to obtain an initial resultant set of documents, download and analyze the results to generate the summary data, extract and store the URLs contained within, query the URLs for more results, and proceed in a recursive process to gather as many URLs as possible.
The key to obtaining the URLs is in a document""s page specification, which is how the page will be assembled when viewed. A common page specification language is HyperText Markup Language, or HTML. In HTML, the URL is coded as an HREF tag. Other page specification languages use similar tags to indicate the presence of a URL.
Crawling is an expensive and time-consuming process, and thus the search engine repositories (as well as other Internet or intranet repositories or databases) are very valuable, as millions of end users access them every day. End users access the search engine""s repository by means of a query. The search engine presents results in the form of a list of summary data, and the user chooses the appropriate item from among the results. Users thus typically access documents in a limited manner, sequentially searching and examining documents until the desired item is found, in contrast with the web crawlers or gatherers, which access documents in a wholesale fashion.
Unfortunately, in addition to building a repository or database, a web crawler or gather can be used to systematically extract and replicate all the information from someone else""s repository or database, by the same querying/parsing/extracting process described above. Thus it is desirable to provide a means to protect an Internet or intranet repository or database from wholesale access yet still provide limited access for the typical end user.
A method and system for protecting a searchable repository containing a document locator when a user searches the repository for the document locator, by replacing the document locator with a unique time-sensitive key are described. The document locator may be a uniform resource locator, or URL. A user search request is intercepted, each URL in the original search result is extracted and replaced with a key, and the altered result returned to the user. When the user selects the key from the search result within the expiration interval, the associated URL and document are able to be retrieved.