As is well-known, search engines return search results that match a search query submitted by a user. The corpus of documents that search engines search can be extraordinarily large, and in some cases, almost unbounded. For example, some search engines search, or at least attempt to search, the entire “World Wide Web.” To facilitate the search, search engines typically build an index of the content of the corpus. The content may contain web pages, and the search engine may add Uniform Resource Identifiers (URLs) of the web pages to the index. However, due to the nature of the World Wide Web, the search engine must first locate a web page in order to add a reference to the web page to the index.
One common technique to locate web pages is to “crawl the web.” Crawling the web means to start with one web page, and follow “hyperlinks” in that web page to discover other web pages. The starting web page might have been provided to the search engine as a web page of interest. For example, the provider of content associated with the web page might provide the search engine with a URL of the web page such that the search engine will, hopefully, include that web page, and also those that it links to, in the index.
However, relying only on crawling the web to discover web pages has limitations. For example, some web pages might never be discovered. Further, content of web pages is subject to change; therefore, web pages should be re-indexed at some point in time. A possible solution to this problem is for the content provider to provide the search engine with a list of URLs rather than a single URL. The content provider might also provide the search engine with information about a particular web page itself, such as how frequently the content on the web page changes. Such information could benefit the search engine in that it would know how often it should re-examine the content of the web page to update the index.
Another technique that may be used by some search engines is to apply algorithms to discover additional information about the content of the web pages. For example, some web pages might be identified by multiple URLs. It would not benefit a user to provide the user with search results identifying all the URLs, as that would, in effect, be providing the user with redundant web pages. If the search engine can discover, through analysis of the content of two web pages, that they are substantially identical, the search engine can discard one web page. As another example, a web page is often divided into a region with content of interest to a search, such as a news article, and content that may not be of interest, such as an ad banner. The search engine could provide better search results if the search engine ignores the ad banner portion of the web page. Thus, the search engine might algorithmically predict which portions of the web page are of interest to a search.
However, a problem with the search engine applying algorithms to attempt to learn more information about the content of web pages is that the algorithms may fail to correctly characterize the content of the web pages. For example, the web page might be constructed such that it is difficult to determine what portion is not relevant to a search. Moreover, applying algorithms may be limited to the content on the web page itself.
Therefore, limitations exist with respect to how well a search engine can provide search results due to the limited amount and accuracy of information that search engines have with respect to the content in the corpus to be searched.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.