This invention relates generally to a system and method for ranking documents to provide a user with a score of the relevance of a document and in particular to a system and method for ranking a hyperlinked document based on a stochastic backoff process.
In typical information retrieval systems, such as search engines, documents are typically ranked, in response to the keyword queries entered by the user, using well-known statistics based on the number and positions of the keywords in each document. A summary of these typical ranking systems is described in a book entitled xe2x80x9cManaging Gigabytesxe2x80x9d by Ian H. Witten, Alistair Moffat and Timothy C. Bell published by Van Nostrand Reinhold, New York, 1994. Such methods will work even in the absence of any hyperlinks between documents. The method proposed here, on the other hand, applies specifically to hyperlinked environments, and uses the presence and locations of hyperlinks to determine the ranks of documents. The weights produced by our procedure can of course be combined with the weights from the standard statistical methods, and the composite weights can also be used for ranking in some applications.
This is not the first proposal for exploiting hyperlinks to rank documents. Kleinberg proposed a different ranking scheme in which he used spectral methods on a square symmetric matrix derived from a subgraph of the crawl graph, to determine so called hubs and authorities for a search topic. J. Kleinberg. Authoritative Sources in a Hyperlinked Environment, Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). His paper also appears as IBM Research Report RJ 10076, May 1997.
Fagin et al. study a variety of different random walks and mathematically characterize their behavior. R. Fagin, A. R. Karlin, J. M. Kleinberg, P. Raghavan, R. Rubinfeld, S. Rajagopalan, M. Sudan and A. Tomkins, Random Walks with xe2x80x9cback buttonsxe2x80x9d, Proceedings of the ACM Symposium on Theory of Computing, 2000, ACM Press, New York, N.Y. A random walk is an abstract process wherein a random path through a directed graph of web pages is followed. The paper describes under what conditions limiting probability distributions exist for the random walks, and provides methods for computing these limiting probabilities when they exist. They do not, however, consider applications of these stochastic processes to the ranking of hypertext documents.
Brin and Page used a method to assign to each document in a hyperlinked environment a Pagerank that is used for ranking the document. S. Brin and L. Page, Anatomy of a Large-Scale Hypertextual Web-Search Engine, Proceedings of the Seventh International World-wide web conference, WWW7, Brisbane, 1998 (published by Elsevier in Amsterdam). The ranking is then used in a search engine to present documents to the user based on rank. This ranking is used, for example, by the Google(copyright) search engine on the world wide web. It should be noted, however, that they use a random walk that is very different from the one we use. For instance, there are no backward steps at all in their schemexe2x80x94instead, they have a random jump operation in which the walker xe2x80x9cteleportsxe2x80x9d to a completely random node in the graph at certain points of the walk.
A stochastic backoff process is a mathematical way to model the behavior of a user that is browsing web pages wherein the domain of web pages is represented by a directed graph. In a stochastic backoff process, the backwards steps of a user from a current node in a directed graph to a prior node may affect the score assigned to a particular web page. The stochastic backoff process provides a good way to model the behavior of a user. Thus, it is desirable to provide a system and method for ranking hyperlinked documents based on a stochastic backoff process and it is to this end that the present invention is directed.
A ranking system and method are provided wherein documents with hyperlinks, such as documents on the World Wide Web (WWW), are ranked according to a stochastic backoff process. The ranking method and system are based on a stochastic process derived from a random walk through the pages of the web. In particular, the input to the method is a crawl of the hyperlinked environment at hand (e.g., the web, a corporation""s intranet or any combination of these). From the crawl, we build a directed graph each of whose nodes is a document in the crawl, with a directed edge from one node A to another node B indicating the presence of a hyperlink from the corresponding document docA to document docB. Next, we define a stochastic process on this graph, as detailed in the following paragraph. Finally, we invoke the formulas in the work of Fagin et al. to compute, for each document, a weight between0 and 1. The documents in our crawl are now ordered by their weights. In response to a query (say, xe2x80x9call documents containing the word dog or any of its synonymsxe2x80x9d), we first retrieve the documents matching the query criteria (a standard task in information retrieval, see for instance Witten et al.). Next, we present these documents to the user sorted in decreasing order of weight.
In accordance with the invention, a system and method for searching for one or more hyperlinked documents that match a search query is provided. The search system generates a directed graph from a crawl through one or more documents wherein the directed graph has one or more nodes representing one or more documents traversed during the crawl and one or more directed edges wherein each directed edge represents a hyperlink from a first document to a second document. A weight for each document in the directed graph may be determined based on a stochastic backoff process. Once this preprocessing has occurred, a search query is received, one or more documents are retrieved that match a search query, and a ranking of the documents based on the determined weight is generated.
In accordance with another aspect of the invention, a system and method for ranking of one or more hyperlinked documents is provided. The ranking method generates a directed graph from a crawl through one or more documents wherein the directed graph has one or more nodes representing one or more documents traversed during the crawl and one or more directed edges wherein each directed edge represents a hyperlink from a first document to a second document. A weight of each document in the directed graph may be determined based on a stochastic backoff process and a ranking of the documents is generated based on the determined weight.