An index is an organized list of references or pointers to a body of text or other indexable material. An index at the back of a book is an example of a type of index. An electronic index can be generated by parsing a body of documents, for example, and creating an alphabetized (or otherwise structured) list of keywords in the documents with pointers to which documents (and possibly also locations in the documents) contain the keywords. As used herein, “index” refers to the electronic variety of index.
To find web pages that match user queries, Internet search engines use large scale indexes of web pages available on the Internet. The number of documents and other types of web pages on the Internet makes the task of generating an index difficult. An index of all web pages takes significant computing resources to create and store. Such an all-encompassing index is inefficient to use due to its size; the time to search in index increases with its size.
Techniques have been used to selectively choose which web pages will or will not be included in an index. However, these techniques have not tried to predict which web pages are likely to be searched by users. Rather, they have taken the approach of using estimates of the so-called general importance of web pages by checking the web hyperlink structure. Detail will be provided below. That is, web pages have been chosen to be included in a search engine index without taking into account actual user search behavior or user-driven factors.
The terms “URL” and “web page” are used interchangeably herein. While a URL may identify a particular instance of a web page, the web page is the actual document and its content. A URL points to a web page and is therefore a shorthand way of referring to the web page itself.
The PageRank technique used by some search engines is a popular method for index selection. PageRank and its variants like HITS (hyperlink-induced topic search) assign a score to each web page according to the hyperlink structure of the web. A web page with a high score (a sufficient number of links into and/or out from the web page) will be selected into the index. However, it is not clear if these kinds of link metrics are effective criterion to decide if a web page should be included in an index. Moreover, such a score is computed from a web graph without considering web content, URL properties, users' search behaviors, and so on.
Techniques related to user-driven index selection are described below.