The World Wide Web (the "Web") is an Internet client-server hypertext distributed information retrieval system. Hypertext objects, for example documents, menus, and indices, are represented in hypertext mark-up language (HTML). Hypertext links refer to other documents by their universal resource locator (URL). These links may refer to local or remote resources that may be accessible via protocols such as http, FTP, GOPHER, TELNET or news. These resources include blocks of data called web pages which may be stored on a server as a file written in HTML. Web pages are typically identified by a URL. A user may view the contents of web pages and may navigate from one page to another through the use of a web browser. A web browser is a program which allows a person to read hypertext. Examples of web browsers include Microsoft Internet Explorer and Netscape Navigator. "Microsoft", the Microsoft logo, "Internet Explorer" and all associated logos are trademarks or registered trademarks of Microsoft Corporation in the United States and other countries. "Netscape", the Netscape logo, "Navigator" and all associated logos are trademarks or registered trademarks of Netscape Communications Corporation in the United States and other countries.
A user may locate web pages having specific contents of interest through the use of a search engine. A search engine is a program that allows a user to perform key word searches for information on the Internet. Digital Equipment Corporation's AltaVista search engine (http://www.altavista.com), for example, indexes hundreds of millions of web pages maintained by computers all over the world. "Digital Equipment Corporation", the Digital Equipment Corporation logo, "AltaVista" and all associated logos are trademarks or registered trademarks of Digital Equipment Corporation in the United States and other countries.
Users provide queries to the search engine and the search engine identifies pages that match the queries, for example pages that include key words contained in the queries. The search engine may take the queries as input and provide a set of web pages in response to the user's queries. This set of pages is known as a result set.
Search engines that use keyword matching perform quite well with well-specified queries consisting of many query terms. Queries with a small result set are called narrow queries. A small result set created from a narrow query is often specific enough to satisfy the user's information needs. However, even though users may have very specific information needs, users often pose queries with only a few query terms. Queries with a large result set are called broad queries. These often arise from queries that contain very few query terms. A large result set created from a broad query may not be specific enough to satisfy the user's information needs.
Various techniques have been developed to enlarge an initial result set to produce another result set that more quickly matches the user's needs. These techniques include providing relevance feedback, analyzing the content of the pages, and analyzing the connectivity of the pages.
The Excite search engine (http://www.excite.com) is an example of a search engine that uses relevance feedback to identify a more relevant result set. "Excite" and the Excite logo are trademarks or registered trademarks of Excite, Inc. in the United States and other countries.
Users of the Excite search engine form an initial query using standard query syntax. The purpose of the query is to specify a topic of interest that may be used to determine a result set containing relevant pages. The user may examine the result set and provide relevance feedback if the result set does not satisfy the user's needs. To provide relevance feedback to the Excite search engine, the user employs the "Find Similar" option to locate a new result set containing pages that are more relevant than the pages found in the initial result set.
The Kleinberg algorithm is an example of a technique that uses connectivity analysis to identify a more relevant result set. Connectivity refers to the link structure that connects pages in a hyperlinked environment such as the Internet. See "Authoritative Sources in a Hyperlinked Environment," Proc. 9.sup.th ACM-SIAM Symposium on Discrete Algorithms, 1998. See also IBM Research Report RJ 10076, May 1997 and "http://simon.cs.cornell.edu/home/kleinber/auth.ps."
The Kleinberg algorithm analyses the link structure, or connectivity, of Web pages in the vicinity of a result set to suggest useful pages in the context of a search that was performed. The vicinity of a Web page is defined by the hyperlinks that connect one page to other pages. The set of all links to all pages on the Internet is called the web graph. Each page may have hyperlinks pointing to other pages, and each page may be pointed to by the hyperlinks of other pages. Pages may be directly linked to other pages or indirectly linked via intermediate pages. Pages that are directly linked are considered to be close pages and pages that are linked via numerous intermediate pages are considered to be distant pages. These links define the connectivity of the pages and may be expressed as a graph where the pages are represented as nodes and the links are represented as directed edges. The number of links required to move from one page to another page provides an indication of how related or unrelated the pages are to each other. Pages that are close to each other tend to contain related topics. Pages that are distant from each other tend to contain unrelated topics.
Connectivity information is useful for increasing the size of the result set. Pages that are located a predetermined distance, usually within a few links, from an initial result set often provide results that are relevant to the input query. The enlarged result set is called a neighborhood graph. A neighborhood graph is a subset of the web graph. A neighborhood graph expresses the connectivity of pages that are located a predetermined distance away from a particular page or a result set. The neighborhood graph includes a node for each page and an edge for each link between pages, up to the predetermined distance.
The Kleinberg algorithm uses the neighborhood graph to assign hub and authority weights to each page in response to a user query. A page with a large authority weight is called a good authority page and a page with a large hub weight is called a good hub page. Hub and authority pages have a mutually reinforcing relationship. A good hub page points to many authority pages and a good authority page is pointed to by many good hub pages. The weights of the hubs and authorities provide information that is useful for ranking the nodes in the neighborhood graph. Usually the nodes are ranked according to their relevance compared to the input query or topic.
One problem with using the Kleinberg algorithm is topic drift. When a user wants to find web pages related to a particular topic, the user enters a query representing that topic into a search engine. The search engine finds a result set containing a list of web pages relating that topic. Using an algorithm like Kleinberg's algorithm, this result set is expanded to include other pages that are at a predetermined distance from the pages in the original result set. However, the content of these new pages might not be on the same topic as the original query. If pages that are not on the original query are ranked highly, then this is called "topic drift."
Topic drift may occur when using connectivity information to enlarge the size of an initial result set to include other pages that are reachable within a few links of the initial result set because pages that are one or two links away do not always match the given query. Topic drift also may occur as a result of the existence of many mutually reinforcing pages in the result set, for example if the hub and authority pages point to each other.
Thus, a need exists for a method of preventing topic drift in hyperlinked environments when an initial result set is enlarged to include pages that may better match a given user query.