The Internet, including the World Wide Web (the “Web”) allows access to enormous amounts of information which grows in number daily. This growth, combined with the highly decentralized nature of the Web, creates a substantial difficulty in locating selected information content. Prior art Web search services generally perform an incremental scan of the Web to generate various, often substantial indexes that can be later searched in response to a user's query. The generated indexes are essentially databases of document identification information. Search engines uses these indexes to provide generalized content based searching but a difficulty occurs in trying to evaluate the relative merit or relevance of identified candidate documents. A search for specific content in documents or web pages in response to a few key words will almost always identify candidate documents whose individual relevance is highly variable. Thus, a user's time can be inefficiently spent on viewing numerous candidate documents that are not relevant to what they are looking for.
Some prior search engines attempt to improve relevancy scores of candidate documents by analyzing the frequency of occurrence of the query terms on a per document basis. Other weighing heuristics, such as the number of times that any of the query terms occur within a document and/or their proximity to each other, have also been used. These relevance ranking systems typically presume that increasing occurrences of specific query terms within a document means that the document is more likely relevant and responsive to the query. However, this assumption is not always accurate.
Another method to determine the relevancy of a document is by using link analysis. Generally, link analysis assumes a that if important web pages point to a document, then the document is also probably important or relevant. However, typical link analysis models a user's search for information on the Web as fluid moving between different containers where the webpages are represented by containers and links out of a webpage are represented by connecting conduits with the same diameter. What this model assumes is that users coming to a webpage must leave the webpage by following one of the links from the webpage and users are equally likely to follow any of the links from the webpage. If a page does not refer to any webpage, it is assumed to refer to all the webpages. By solving a steady state solution of the system, the model finds the relative likelihood of finding the user on a webpage if a snapshot of the system was taken. The basic problem with the model is that people are not like fluids.
Rather, people can evaluate the relevance of a webpage for a query. That has two implications on the behavior of the user in the system: 1) users will be likely to stop searching based on the relevance of a webpage, and 2) choosing between two links, users will be more likely to follow a link to the more relevant page.
Based on these implications, there is a need for a relevance ranking system where the probability of not leaving a webpage is a function of the relevance of the webpage, and the probability of following an outgoing link from a webpage is a function of the relevance of all referred webpages and the relevance of the webpage.
The present invention provides a method and system for generating relevancy rankings that cures the above problems and others.