The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, and the ease with which material may be published on the WWW, the quality and relevance of web pages varies greatly. These features of the WWW make ranking of web pages by their authoritativeness or relevance an important task. Ranking is often integrated with WWW search engines. These search engines use various means to determine the relevance of web pages, including their ranks, to a user-defined search.
The authors of web pages provide information known as metadata within the body of the document that defines the web pages. This document is typically written in, for example, hypertext markup language (HTML). A computer software product known as a web crawler systematically accesses web pages by sequentially following hypertext links (hyperlinks) from page to page.
The crawler indexes the pages for use by the search engines using information about a web page as provided by its address or Uniform Resource Locator (URL), metadata, and other criteria found within the page. The crawler is run periodically to update previously stored data and to append information about newly created web pages. The information compiled by the crawler is stored in a metadata repository or database. The search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
Internet search engines provide a primary interface between an Internet user and the web pages or web sites accessible through the Internet. Consequently, Internet companies are expending resources to further improve searching results in both accuracy and response time to attract more Internet users to the web sites of the Internet companies. Higher Internet traffic on the web site of an Internet company typically increases revenue for the Internet company through, for example, increased sales at that web site or greater exposure of the Internet user to advertisements on the web site of the Internet company.
An exemplary search engine is the Google® search engine. An important aspect of the Google® search engine is the ability to rank web pages according to the authority of the web pages with respect to a search. One of the ranking techniques used by the Google® search engine is the PageRank algorithm. Reference is made to Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd, “The PageRank citation ranking: Bringing order to the web,” Technical report, Stanford Digital Library Technologies Project, 1998. Paper SIDL-WP-1999-0120 (version of Nov. 11, 1999). The PageRank Algorithm calculates a stationary distribution of a Markov chain induced by hyperlink connectivity on the WWW and uses that distribution to rank all web pages. This same technique used by the PageRank algorithm applies to intranets or subsets of the WWW.
Although the PageRank algorithm has proven to be useful, it would be desirable to present additional improvements. The calculations performed by the PageRank algorithm require large amounts of data and large amounts of processing time. The WWW is growing rapidly; consequently, the computations performed by the PageRank algorithm are becoming increasingly difficult. In addition, web sites are increasingly using a variety of techniques to manipulate their ranking in order to generate user traffic on the web site, increase sales through commercial web sites, and increase advertising revenue.
Further, the use of templatized hyperlinks on web sites is increasing rapidly. Templatized web pages share a common administrative authority, a common look, and a common feel. For a user, the common look and feel is valuable because it provides context for browsing. However, templatized pages skew ranking. Since all pages that conform to a common template share many links, it is clear that these links cannot be relevant to the specific content on these pages.
Currently, the Google® search engine indexes about 3.3 billion web pages with nearly 90 billion hyperlinks. Representing these hyperlinks as source and destination URLs amounts to approximately ten terabytes of data. The hyperlinks are viewed as a link graph by the PageRank algorithm. In most implementations of search engines, a typical hyperlink is represented by a four-byte ID. Use of the four-byte ID reduces the amount of data required to represent the link graph to 360 gigabytes at the cost of considerable processing time in replacing the URL with the corresponding four-byte ID. Once the four-byte ID has been determined for the URL, the PageRank algorithm calculates a stationary distribution of a Markov chain, requiring approximately 30 to 50 cycles through the data set of the link graph to achieve a reasonable level of convergence.
Furthermore, the analysis by the PageRank algorithm of each individual URL makes the PageRank algorithm susceptible to deliberate ranking manipulation of web pages. One example of ranking manipulation is link spamming. One method of link spamming involves posting messages on message boards, guest books, etc., with links to a web site. These increased links increase the ranking of the website. Another method of link spamming involves forming or joining a “link farm”. A link farm is a network of web pages or web sites that are heavily cross-linked. When joining a link farm, a web site receives a link from all the other web sites in the link farm and, in return, places links to all the other web sites in the link farm. However, the reputation and popularity of search engines such as the Google® search engine relies on an accurate ranking of the web sites in response to a search.
One technique proposed for improving the ranking of web pages involves the use of a host rank that groups web pages based on the host of the web page. Although the host rank technique has proven to be useful, it would be desirable to present additional improvements. Many hosts comprise web pages that are fairly uniform in content and in quality. However, a host such as www.geocities.com that provides free web space to users comprises widely varying content both in topic and quality. Some of the subsites on www.geocities.com comprise, for example, very high quality open source software projects. These highly respected subsites have many links into them. Other subsites on www.geocities.com comprise personal information about users, their hobbies, etc. The range of topic and quality of subsite in www.geocities.com requires a finer granularity than the host rank for analysis and grouping. Reference is made to “Ranking the Web Frontier and Arvind Arasu, Jasmine Novak, Andrew Tomkins & John Tomlin, “PageRank Computation and the Structure of the Web: Experiments and Algorithms,” Proceedings of WWW2002, May 2002; and co-pending U.S. patent application titled “System and Method for Rapid Computation of PageRank”, Ser. No. 10/132,047, by A. Arasu, Andrew Tomkins and John Tomlin, which was filed on Apr. 25, 2002.
What is therefore needed is a system, a computer program product, and an associated method for improving the efficiency of ranking web pages while minimizing manipulation of the ranking process by web sites and Internet companies. The need for such a solution has heretofore remained unsatisfied.