The Internet is becoming increasingly important as a repository of information. For example, such information may be stored on the World Wide Web (“Web”) in the form of Web pages. To search or access information located on the Web, a user typically uses a search engine such as, for example, GOOGLE, YAHOO, ASK JEEVES, MSN SEARCH or the like. Search engines generally operate by creating indices by spidering or crawling over Web pages. Typical crawlers today discover and index Web pages simply by following the hyperlinks from one page to another. Using this method, in order for the search engines to index a page, the page has to be static and, in addition, have other pages linking to it, so that it can be discovered through the crawling. Unfortunately, an ever-increasing amount of information is available to users only through site-specific search interfaces. In order to access these Web pages, a user must input one or more keywords or text strings into the site-specific search interface. Conventional search engines are unable to discover and index these pages because they are dynamically generated—there are no static links to these pages. These “hidden” pages are often referred to as the “Hidden Web” or the “Deep Web.”
The volume of information contained in the Hidden Web is increasing rapidly as many entities and organizations place their content online through easy-to-use Web interfaces. For example, the Securities and Exchange Commission and the United States Patent and Trademark Office each make available public documents via Web-based search interfaces. The content of these databases is, however, hidden from users that are searching using conventional search engines. Moreover, the content of many Hidden Websites is often highly relevant and useful to particular searches performed by users. For example, PubMed hosts numerous high-quality documents on medical research that have been selected from a carefully conducted peer-review process. The documents contained in the PubMed database are generally hidden from users unless they use the site-specific search interface.
There thus is a need for a method and system that is capable of automatically identifying and downloading Web pages from the Hidden Web so that conventional search engines (e.g., GOOGLE, YAHOO, ASK JEEVES, MSN SEARCH, etc.) can index and subsequently access the pages. There also is a need for a method and system for the generic information retrieval from Hidden Web pages. The method may be implemented using a software program such as a crawler that automatically downloads Web pages for search engines. Preferably, the crawler is able to download or otherwise make available Web pages such that current search engines are able to index the Web pages. Alternatively, Hidden Web pages may be downloaded or replicated locally on a user's computer. The Hidden Web pages are thus made available to users via conventional search engines.
The method and system of downloading and indexing Hidden Web pages will allow typical Internet users to easily access information from a single location (e.g., a single search engine) that, previously, was available only by searching through site-specific search interfaces. The method and system would improve the overall user experience by reducing wasted time and effort searching through a multitude of site-specific search interfaces for Hidden Web pages. Finally, current search engines introduce a significant bias into search results because of the manner in which Web pages are indexed. By making a larger fraction of the Web available for searching, the method and system is able to mitigate the bias introduced by the search engine to the search results.