The World Wide Web (Web) is a rapidly growing part of the Internet. One group estimates that, as of the beginning of 2000, the Web grows more than seven million web pages each day, adding to an already enormous body of information. Because of the Web's rapid growth and lack of central organization, however, millions of users cannot find specific information in an efficient manner. Over the last decade, Internet search engines, such as BECOME.com search engine, became some of the most important means of information retrieval on the Internet indexing over billions of web pages. As search engines increase their coverage, however, they exacerbate an existing problem. Search engines pull up all documents meeting the search criteria, which can overwhelm a searcher with millions of irrelevant documents. Once search results arrive, the searcher must review them one document at a time to find the relevant ones. Even if could the searcher can download many documents, average searchers are not always willing to review more than the first page of the search result display. Therefore, it is crucially important to present the most relevant documents to the searchers at the top of the list (e.g., in first ten results).
Because millions of documents may outwardly match the search criteria, the major search engines have a ranking algorithm that ranks high those documents having certain keywords in certain locations such as the title, or the meta-tags, or at the beginning of a document. This does not, however, typically put the most relevant document at the top of the list; much less assess the importance of the document relative to other documents.
Moreover, relying solely on the content of the document itself—including the meta-tags that do not appear when displayed—to rank the document can be a major problem to the search engine. A web author can repeat “hot” keywords many times, as a practice called spamming (e.g., in the title or meta-tags) to artificially inflate the relevance of a given document. Therefore, most Internet search engines in operation today use one of the variations of the link structure analysis. PageRank algorithm used by Google, for example, has been proven to be an effective measure against the conventional keyword-based spamming techniques. Recently, however, even PageRank has been found to be susceptible to a new generation of more sophisticated spamming techniques that manipulate the link structure of the Web. Over the years, webmasters and so-called “search engine optimization engineers” have learned how PageRank works and have figured out ways to manipulate its algorithm. One such technique is called “Google bombing” and has given Google many cases of unwanted publicity.
Another less known, yet potentially more damaging technique is called an “artificial Web”. With a moderate investment, spammers can purchase a few IP addresses and large amount of disk storage spaces. The spammers can easily write scripts to generate millions or even billions of simple web pages that contain links to a few websites to be promoted. As the number of these artificial web pages can be comparable to that of the major portion of the real Web, the spammers can wield undue influence in manipulating the link structure of the entire Web, thereby affecting the computation of PageRank.
Vulnerability to the artificial Web reveals fundamental limitations of the conventional link analysis algorithms such as PageRank. One of the main reasons for their shortcoming is that these methods count all documents equally. The homepage of Yahoo.com is counted as one document just as the homepage of an obscure website maintained by a fourth-grader. This makes it possible for an artificial Web to siphon out substantial quantity of weighting factor from the real Web.
It is therefore desirable to provide a method for assigning relative quality scores of web pages with respect to one another that is not susceptible to these kinds of highly sophisticated spamming techniques.