1. Field of the Invention
The present invention relates to search engine technology. In particular, the present invention relates to search engines and methods for quick retrieval of relevant and timely documents from a wide area network, such as the World Wide Web.
2. Discussion of the Related Art
The search engine is an important enabling application of the internet which allows the user to quickly identify and retrieve information (“web pages”) from the World Wide Web (WWW). In fact, the search engine has caused a profound consumer behavioral change: the user now prefers typing his data retrieval criteria into a “search box” to “browsing” or traversing painstakingly and manually cataloged hierarchical directories. Today, more than a hundred million searches are performed every day on the several billion web pages of the WWW. Yet, existing methods remain unsatisfactory in addressing the most basic search problems.
Three desired qualities are fundamental to a search: the relevancy of the search results returned, the extent of the coverage (“scope”) over the WWW, and the age (“timeliness”) of the information retrieved. As to relevancy, as the index size grows current search engines should aim to achieve ever greater refinement and accuracy on the web pages they find and rank, so that the first few web pages returned to a user would correspond precisely to the information the user is seeking. With respect to scope, even the largest search engines index only a fraction of the WWW at the present time. Nevertheless, most of the web pages that are indexed are never returned as search results to actual queries. Thus, search engines should improve the scope of their indexing, especially automatic indexing, so that a greater portion of the useful content that exists on the WWW can be made available and more efficiently accessed. Also, the largest search engines today are unable to refresh their search indexes quickly enough to return only current information from the WWW. Today, these search engines often return many web pages which content are significantly changed from when they were indexed; at worst, some indexed web pages simply no longer exist (i.e., “dead links”).
To improve relevancy, some search engines take a “tiered” approach. Under a tiered approach, a search engine gives greater weight in its indexing to one or more small subsets of the WWW, which are often handcrafted, hierarchical directories that it considers to be of high quality. However, because the web pages in the subsets are manually selected, these web pages often lag in time relative to the rest of the index.
To improve scope, niche “meta-search engines” try to provide an equivalent of a larger search index by combining results from multiple search engines. However, by combining the results of many search engines, these niche meta-search engines erase from the results the effects of the included intelligence or careful tuning of the algorithms in each individual, proprietary search engine. The resulting web pages retrieved are also often ranked in an ad-hoc fashion, resulting in a substantial loss of relevancy.
To improve timeliness, current search engines often identify web pages which content change frequently, and accordingly re-index these web pages more frequently than other web pages. Another approach evaluates a web page's historical change frequency and adaptively accesses the web page at a rate commensurate with the recent change frequency. However, these approaches can manage an index over only a relatively small subset of the WWW, and even then only with limited efficiency. In fact, many changes to a web page (for example, a dynamic time-stamp) may not significantly impact the search results to actual queries. Consequently, much of the WWW “crawling” (i.e., content discovery, also called “spidering”) and updating efforts are believed wasted.
Some solutions to these problems are disclosed in U.S. Pat. Nos. 5,701,256 and 6,070,158 relating, respectively, to proteomic sequences search engine and to phrase-based WWW search engine and meta- or distributed search engines.
U.S. Pat. No. 6,070,158 by William Chang provides an example of the construction of a large-scale search engine.