It has become common for users of host computers connected to the World Wide Web (the "Web") to employ Web browsers and search engines to locate Web pages having specific content of interest to users. A search engine, such as Digital Equipment Corporation's AltaVista search engine, indexes hundreds of millions of Web pages maintained by computers all over the world. The users of the hosts compose queries, and the search engine identifies pages that match the queries, e.g., pages that include key words of the queries. These pages are known as a result set.
In many cases, particularly when a query is short or not well defined, the result set can be quite large, for example, thousands of pages. The pages in the result set may or may not satisfy the user's actual information needs. For this reason, most search engines rank order the result set, and only a small number, for example, twenty, of the highest ranking pages are actually returned. Therefore, the quality of search engines can be evaluated not only by the number of pages that are indexed, but also on the usefulness of the ranking process that determines the order in which pages are returned. A good ranking process will return relevant pages before pages that are less relevant.
Sampling of search engine operations has shown that most queries tend to be quite short, on the average about 1.5 words. Therefore, there is usually not enough information in the query itself to rank the pages. Furthermore, there may be pages that are very relevant to the search that do not include any of the key words specified in the query. This makes good ranking difficult.
In Information Retrieval (IR), some ranking approaches have used feedback by the users. This requires the users to supply relevance information for some of the results that were returned by the search to iteratively improve ranking. However, studies have shown that users are generally reluctant to provide relevance feedback. In addition, the database environment of the Web is quite different from the setting of conventional information retrieval systems. The main reasons are: users tend to use very short queries; the collection of pages is changing continuously; and processing all pages in the World Wide Web corpus is practically not feasible.
In one prior art technique, an algorithm for connectivity analysis of a neighborhood graph (n-graph) is described by Kleinberg in "Authoratative Sources in a Hyperlinked Environment," Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998, and also in IBM Research Report RJ 10076, May 1997, see, "http:/www.cs.cornell.edu/Info/People/kleinber/auth.ps." The algorithm analyzes the link structure, or connectivity of Web pages "in the vicinity" of the result set to suggest useful pages in the context of the search that was performed.
The vicinity of a Web page is defined by the hyperlinks that connect the page to others. A Web page can point to other pages, and the page can be pointed to by other pages. Close pages are directly linked, farther pages are indirectly linked. This connectivity can be expressed as a graph where nodes represent the pages, and the directed edges represent the links. The vicinity of all the pages in the result set combined is called the neighborhood graph.
Specifically, the algorithm attempts to identify "hub" and "authority" pages in the neighborhood graph for a user query. Hubs and authorities exhibit a mutually reinforcing relationship; a good hub page is one that points to many good authority pages, and a good authority page is pointed to by many good hubs. Kleinberg's algorithm constructs a graph for a specified base set of hyperlinked pages. Using an iterative algorithm, an authority weight x and a hub weight y is assigned to each page. When the algorithm converges these weights are used to rank the pages as authorities and hubs.
When a page points to many other pages with large x values, the page receives a large y value and is designated as a good hub. When a page is pointed to many pages having large y values, the page receives a large x value and is designated as a good authority.
However, there are some problems with Kleinberg's algorithm which is due to the fact that the analysis is strictly based on connectivity. First, there is a problem of topic drift. For example, if a user composes a query that includes the key words "jaguar" and "car," then the graph will tend to have more pages that mention "car" than "jaguar." These self-reenforcing pages will tend to overwhelm pages mentioning "jaguar" to cause topic drift.
Second, it is possible to have multiple "parallel" edges from pages stored by a single host to the same authority or hub page. This occurs when a single Web site stores multiple copies or versions of pages having essentially the same content. In this case, the single site has undue influence, hence, the authority or hub scores may not be representative.
Third, many Web pages are generated by Web authoring or database conversion tools. Frequently, these tools will automatically insert hyperlinks. For example, the Hypernews system, which turns USENET News articles into Web pages, automatically insert links to the Hypernews Web site.
In U.S. patent application Ser. No. 09/007,635. "Method for Ranking Pages Using Connectivity and Content Analysis" filed by Bharat et al. on Jan. 15, 1998, a method is described with examines both the connectivity and the content of pages to identify useful pages. However, the method is slow because all pages in the neighborhood graph are fetched in order to determine their relevance to the query topic. This is necessary to reduce the effect of non-relevant pages in the subsequent connectivity analysis phase.
Therefore, there is a need to reduce the effect of unrelated pages on the computation in a manner that does not require fetching all pages in the neighborhood graph. If a small, carefully selected subset of pages can be identified for topic distillation, then meaningful ranking results can be presented to users in a more timely manner.