Many search engine services, such as Google and Yahoo, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service identifies web pages that may be related to the search request based on how well the keywords of a web page match the words of the query. The search engine service then displays to the user links to the identified web pages in an order that is based on a ranking that may be determined by their relevance to the query, popularity, importance, and/or some other measure.
One well-known technique for page ranking is PageRank, which is based on the principle that web pages will have links to (i.e., “out links”) important web pages. The importance of a web page is based on the number and importance of other web pages that link to that web page (i.e., “in links”). PageRank is based on a random surfer model of visiting web pages of a web graph (vertices representing web pages and links representing hyperlinks) and represents the importance of a web page as the stationary probability of visiting that web page. In the random surfer model, a surfer visiting a current page will visit a next page by randomly selecting a link of the current web page. If the current web page has three out links to target web pages, then the probability of visiting each target web page from the current web page is ⅓. PageRank is thus based on a Markov random walk that only depends on the information (e.g., hyperlink) of the current web page.
A web graph may be represented as G=<V, E>, where V={1, 2, . . . , n} is the set of vertices and E={<i,j>|i,j εV} is the set of edges. The links between web pages can be represented by an adjacency matrix A, where Aij is set to one when there is an out link from a source web page i to a target web page j. The importance score wj for web page j can be represented by the following:wj=ΣiAijwi  (1)
This equation can be solved by iterative calculations based on the following:ATw=w  (2)where w is the vector of importance scores for the web pages and is the principal eigenvector of AT.
PageRank may also factor in that a surfer may randomly select a web page to visit next that is not linked to by the current web page. Thus, the surfer may next visit a target web page of the current web page with a probability of α and next visit a randomly selected web page with a probability of 1−α. To factor in this random selection of web pages, PageRank generates an initial transition matrix P by normalizing each non-zero row of the adjacency matrix with the sum of its elements. PageRank then sets each element of a zero row in matrix P to 1/n to generate transition probability matrix P. The model of representing the random selection of links of target web pages and the random selection of web pages can be represented by the following: P=α P+(1−α)U  (3)where P is the combined transition probability matrix and U is a uniform probability distribution matrix in which each element is set to 1/n. PageRank considers the stationary distribution π=(π1, π2, . . . , πn)T of the transition probability matrix P to represent the importance of each web page. PageRank may compute the stationary distribution through an iterative process as represented by the following:π(t+1)=( P)Tπ(t)  (4)where π(0)=(1, 1, . . . , 1)nT, t represents the iteration count, and the iterative process continues until π converges on a solution.
A fundamental assumption of PageRank is that a user randomly selects any of the hyperlinks on the current web page. This assumption is, however, incorrect when the user has additional information available to help in deciding which hyperlink to select. A user presumably wants to maximize their information gain and so a user with this additional information will likely select the hyperlink that will lead to the maximum information gain.