1. Field of the Invention
The present invention relates to an page-ranking metho; (e.g., and system) and, more particularly, an page-ranking method which includes mining a portion of an network (e.g., intranet) user's desktop computer (e.g., files, documents, bookmarks, e-mail, and potentially all other content associated with and/or stored on the user's personal computer) in order to compute the ranking.
2. Description of the Related Art
Internet search engine websites such as www.google.com (hereinafter “Google™”) are known to work well for searching the Internet. The Google™ search engine has an important feature that helps it produce high precision results in Internet searches. Namely, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank.
However, while Internet search engines such as Google™ may work well for searching the Internet, these search engines do not work well for searching other networks such as an intranet. The strategy of such Internet search engines assumes that there is a relatively high number of hyperlink references to high-value pages. The number of such references and the value of the referring pages can be used to compute the value of the referenced page.
However, while these assumptions have been proven to work well for the Internet, the assumptions do not work well for other networks such as intranets. This is due at least in part to the fact that other networks (e.g., intranets), unlike the Internet, lack economic and other incentives for cross-referencing other valuable pages.
For example, an intranet page's author generally has little motivation to embed such references in the pages for which he is responsible. Therefore, the density of such references in intranet pages (e.g., documents), relative to the density of such references in Internet pages, is low. Further, the number of references to a given intranet page that do exist may have little to do with the value of that intranet page.
FIG. 1 illustrates a related art system 10 (e.g., modular scoring system) for ranking search results which is disclosed by Fagin et al. (“SYSTEM, METHOD AND SERVICE FOR RANKING SEARCH RESULTS USING A MODULAR SCORING SYSTEM”, U.S. Pat. Pub. No. 2005/0262050), and which is commonly assigned with the present invention and incorporated by reference herein.
The system 10 includes a computer program product stored on a host server connected to a network (e.g., intranet, Internet, etc.). The system 10 includes a set of scoring modules 105, a duplication module 110, and a rank aggregation processor 115. Each of the scoring modules 105 takes as input one or more graded sets of documents (e.g., pages), an auxiliary information module 125, and (optionally) a query 120. Output from each of the scoring modules to the rank aggregation processor 115 is a ranked set of documents. The rank aggregation processor 115 may weight the outputs from each of the scoring modules 105 equally, or it may weight the outputs from each of the scoring modules 105 differently to meet scoring requirements of a specific client, user, intranet, or network.
Scoring modules 105 include a set of indices 130 such as, for example, a content index 135, a title index 140, and an anchortext index 145. Additional indices may be used as desired. The content index 135, the title index 140, and the anchortext index 145 take as input query 120 and find a set of documents in dB 40 that match the text of input query 120. The indices provide pointers into the set of documents in dB 40 containing the query terms, and pass them to the union module 150 and to the rank aggregation processor 115.
Indices 130 provide graded lists of found documents that are scored using any suitable scoring analysis such as, for example, TF*IDF (Term Frequency Times Inverse Document Frequency). TF*IDF scores a document based on the number of times a query term appears in a document: the higher the term frequency, the more relevant the document. Further, TF*IDF weights the relevance of a query term based on the inverse of the number of documents containing the query term. TF*IDF places more weight on a less common term than a more common term as determined by the number of documents found with each term. Consequently, documents with the highest number of least common terms in the search query receive the highest score.
The outputs of indices 130 are combined in a union module 150 to form a single graded set of documents. The duplication module 110 duplicates the single graded set of documents as needed to provide inputs to the scoring modules 105. As needed, scoring modules 105 may also utilize query 120 and auxiliary information module 125 as input. Scoring modules 105 further include ranking or scoring processors such as, for example, a page-ranking processor 155, an indegree processor 160, a discovery date processor 165, a uniform resource locator (URL) word processor 170, a URL depth processor 175, a URL length processor 180, a geography processor 185, a discriminator processor 190, etc. The scoring modules 105 may be selected or deselected by selection module 195 as needed for a query, a user, a client, a network (e.g., an intranet), etc.
The rank aggregation processor 115 utilizes a variety of methods to aggregate the outputs of the scoring modules 105 such as, for example, positional methods, graph methods, or Markov chain methods. For example, when using positional methods, the rank aggregation processor 115 gives each document an output score that is computed as a function of the various ranks received by a document from the scoring modules 105. The output score assignment may be determined by, for example, the mean rank or the median rank. A document is then scored by the output rank received.
However, it is desirable to improve the quality of the rankings associated with a page-ranking system and method.