Graphs are used in the representation and analysis of many information structures. Ranking nodes in such graphs by their quality or importance is of great value. For example, the World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these web pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of the information retrieved to a user-defined search. Thus, ranking of web pages by their importance or authoritativeness is an important task.
A typical search engine has an interface where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the search terms, and returns the search results in the form of web pages in, for example, HTML. Each search result comprises a list of individual entries that have been identified by the search engine as satisfying the search expression. Each entry or “hit” comprises a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
An exemplary search engine is the Google® search engine. An important aspect of the Google® search engine is the ability to rank web pages according to the authority of the web pages with respect to a search query. One of the ranking techniques used by the Google® search engine is the PageRank algorithm. Reference is made to L. Page, et al., “The PageRank citation ranking: Bringing order to the web,” Technical report, Stanford Digital Library Technologies Project, 1998. Paper SIDL-WP-1999-0120. The PageRank algorithm calculates a stationary distribution of a Markov chain induced by hyperlink connectivity on the WWW. This same technique used by the PageRank algorithm applies to other directed graphs where edges or links imply endorsement or trust.
In addition to scoring of pages on the world wide web, the technique of PageRank also applies to scoring of nodes in other types of networks. Examples include the scoring of patents by the scores of other patents that contain citations to it, the scoring of scientific literature that contains bibliographic citations, and the scoring of trust among individuals using the knowledge of trust relations between individuals.
Although the PageRank algorithm has proven to be useful, and is applicable to other information graphs as well as the Web, it would be desirable to present additional improvements. In many graphs, including the web, nodes may either have no outlinks or their outlinks are not accessible to a ranking processor; these nodes are known as “dangling nodes”. A node may be dangling for a variety of reasons. For example, in the context of the web graph, the web page may have not yet been crawled. In other cases, the node may genuinely have no outlinks, etc.
A web page is further considered a dangling web page when protected by a robots.txt file. Use of a robots.txt file by a web page places the web page “off-limits” under a standard practice of crawling. However, such web pages may comprise high-quality information that is of great interest to readers and worthy of indexing.
In certain cases, particularly in the context of World Wide Web analysis, ranking of certain kinds of dangling nodes might be particularly important. For instance, some web pages become dangling when they are deleted from the web by their author. Paradoxically, there may be very good reasons to calculate a rank of a web page that was deleted (e.g., a significant document that was removed for political or legal reasons).
Even if a web page cannot be crawled, it can still be indexed using its anchor text. While anchor text is not a substitute for full text indexing, it has proved to be remarkably effective in satisfying most web search queries. Reference is made to N. Craswell, et al., “Effective site finding using link anchor information,” In Proc. of the 24th annual international ACM SIGIR conference on research and development in information retrieval, web pages 250-257, New Orleans, La., USA, September 2001, Association for Computing Machinery; N. Eiron, et al., “Analysis of anchor text for web search,” In Proc. of 26th ACM SIGIR, web pages 459-460, 2003; and R. Fagin, et al., “Searching the workplace web,” In Proc. 12th World Wide Web Conference, Budapest, 2003.
Another source of dangling nodes is nodes that have no outlinks. For example, in the Web graph, most PostScript and PDF files on the web contain no embedded outlinks, yet the content may be of relatively high quality. A URL may also be a dangling web page if it has a meta tag requesting that links not be followed from the web page. Further, a URL may be a dangling web page if it requires authentication (e.g., most of the Wall Street Journal site). Other reasons for dangling web pages comprise those links to pages that return a 500-class or 400-class error response at crawl time indicating that the web page is not available. Furthermore, some links may point to servers that are not resolvable in DNS, experience routing problems, etc. A 400-class error response comprises non-existent web pages, web pages requiring a password for access, etc. A 500-class error response comprises configuration problems, load problems, etc. In other information graphs, similar situations exist. In a citation graphs for scientific literature, for example, some works may only cite other works that are outside the field being analyzed. In a trust network, some of the nodes may represent people for which the list of people they trust is unavailable.
Conventional graph ranking techniques treat all types of dangling nodes identically. Conventional graph ranking techniques remove dangling nodes from the graph before calculating the ranking and then add the dangling nodes back into the graph ranking analysis. Reference is made to L. Page, et al., “The PageRank citation ranking: Bringing order to the web,” Technical report, Stanford Digital Library Technologies Project, 1998. Paper SIDL-WP-1999-0120 (version of Nov. 11, 1999); and S. Kamvar, et al., “Exploiting the block structure of the web for computing pagerank,” Technical report, Stanford University, 2003.
Another conventional graph ranking technique removes the dangling nodes entirely. Reference is made to S. Brin, et al., “What can you do with a web in your pocket?,” Data Engineering Bulletin, 21:37-47, 1998; and T. Haveliwala, “Efficient computation of pagerank,” Technical report, Stanford University, 1999. Removing the dangling nodes entirely skews the results on the non-dangling nodes somewhat since the outdegrees from the non-dangling nodes are adjusted to reflect the lack of links to dangling nodes.
Conventional approaches to ranking with dangling nodes do not account for the various types of dangling nodes. Further, the effect of the dangling nodes is not propagated to other parts of the graph, neglecting the effect of the dangling nodes on the rankings of non-dangling nodes. Moreover, if a decision needs to be made on how to further explore a partial graph with limited resources, such as in the case of a web crawler that needs to decide which links to dangling web pages should be followed first in its crawl strategy, it becomes important to assign ranks to dangling web pages to efficiently manage crawling resources. Reference is made to S. Abiteboul, et al., “Adaptive on-line web page importance computation,” In Proc. 12th World Wide Web Conference, web pages 280-290, 2003; and J. Cho, et al., “Efficient crawling through url ordering,” In Proc. of 7th World Wide Web Conference, 1998.
In certain information network graphs it appears that there is a growing trend toward “node rot”, where certain nodes that used to be valuable turn out to later reflect negatively on nodes linking to them. For example, in a graph for patent citations, when a patent is invalidated, the ranking of patents citing it should probably be degraded. In the case of the Web graph, several studies have forecast the half-life of a URL at between four and five years. When a page is deleted, links that point to it will become “broken”. The existence of broken links on a page may be taken to be an indication of low standards on part of its author. Reference is made to J. Markwell, et al., “Link rot limits the usefulness of web-based educational materials in biochemistry and molecular biology,” Biochem. Mol. Biol. Educ., 31:69-72, 2003; and D. Spinellis, “The decay and failures of web references,” Comm. ACM, 46(l): 71-77, 2003. In one crawl of over a billion web pages approximately 6% of all web pages returned a 404 HTTP error code. These presumably reflect a fraction of web pages that are no longer maintained or were poorly authored. As time passes, this problem will only worsen as an increasing fraction of web pages on the web fall into disrepair.
Dangling nodes may also appear in other types of networks besides the world wide web. Examples include scientific papers that cite no other papers, patents that cite no other patents, or people who trust nobody else. In addition, other types of networks may have nodes that qualify as penalty nodes. An example is provided in trust networks between individuals, in which a person may have expressed trust in a person who is convicted of a crime. In this case, the person who trusted the criminal may themselves have their own trust level decreased in acknowledgment of their ill advised trust in the criminal. Another example is provided in scientific literature that cites a paper that is later discredited for some reason (e.g., fraudulent data or improper experimental methodology). In this case, the discredited literature becomes a penalty node, and papers that cite it for their own evidence may have their score decreased accordingly.
What is therefore needed is a system, a computer program product, and associated methods for: (a) ranking dangling nodes in a graph; and (b) adjusting the rank of “penalty pages”. The need for such a solution has heretofore remained unsatisfied.