This disclosure relates generally to searching document repositories and, content that is not accessible via the World Wide Web.
With the advent of the Internet and the World Wide Web, search engines were created to assist users in locating information from the millions of web pages accessible using these technologies. The search engines have become very efficient at indexing documents and responding to queries and have become familiar to many people. But, much of the data, documents, and other files used by organizations are not available via the Internet or an intranet and cannot be accessed through a uniform resource locator (URL). A URL is an address for a particular web-page, document, or other file that enables web-servers, search engines, and browsers to access the content of the particular file. A URL usually includes a domain portion, such as “www.myco.com” or “www.acme.net,” and a path portion for a file, such as “/myfile/index.htm.”
Files not accessible to the Internet are typically not indexed by web-based search engines. A search engine, such as an intranet search appliance or an Internet search engine, can index the content of such inaccessible files by receiving the content from a push by a connecting module. For example, a connecting module may acquire the contents of such files and push the contents of the files to the search engine in a linear stream. The search engine may index the contents and store the contents in its cache. To make the contents accessible in response to a search query, the search engine provides a URL to the cached contents.
Such systems suffer from drawbacks including: processing time and staleness. To push the contents of a file system to the search engine may take considerable time, and during the push the contents of some files may be lost without notice. Furthermore, the search engine cannot control the load on the web server pushing the file contents. Thus the process of pushing documents may be time and processing intensive. Moreover, the contents of the document served from the cache may become out-of-date, and the search engine may have no method of updating the contents without receiving another push.