1. Field of the Invention
This invention relates generally to computerized information retrieval, and more particularly to retrieving, indexing, and ranking of documents in a hyperlinked information environment such as the World Wide Web (the “Web”).
2. Description of the Related Art
The amount of information stored in the Web continues to increase. This makes it more difficult for users to find pages relevant to concepts of interest. Users of computers connected to the Web commonly employ search engines to locate Web pages having specific content. A search engine, such as the AltaVista® search engine, indexes hundreds of millions of Web pages hosted and served by computers all over the world. The users of such engines compose queries, and the search engine identifies pages that match the queries, e.g., pages that include the key words of the queries. The Web is a hyperlinked environment. Pages in the Web generally contain links to other Web pages. The links enable users to navigate the Web. In the page containing the link, usually there is some text associated with the link. In typical browsers the user clicks on this text to follow the link. This text is known as anchortext. For instance, a page about travel may contain a link to www.ual.com, the home page of United Airlines. The anchortext associated with the link might be “United,” “United Air Lines,” or “U.A.L,” entirely at the discretion of the author of the page linking to the United site.
A challenge for search engines is to identify the most relevant resources to the query and to place them first among all the results returned. This ordering of the result set by relevance of results is known as “ranking.” Ranking based solely on the content of the documents is only partially effective on such a large scale. Other factors, in particular anchortext, are necessary. One source of difficulty in locating the most relevant documents is the lack of an effective system and method for determining the relevance of indexed documents based on terms used by persons linking to a Web page. In addition to the textual content of the individual pages, the link structure of such collections contains information that can be tapped when identifying the most authoritative sources. The text associated with a link called “anchortext” also provides information useful for identifying relevant and important documents.
Yanhong Li discusses a system called Hyperlink Vector Voting, or HVV, which uses the content of hyperlinks to a document to rank its relevance to the query terms. See Li, Toward A Qualitative Search Engine, IEEE Computing, July-August 1998. HVV assigns importance to pages by analyzing the inbound links to a particular Web site. Authors of Web pages in effect vote for, or endorse, a Web page to which they include hyperlinks. Li provides an example of a page that uses the word “attorney” throughout. However, the word “lawyer” is not used at all. Nevertheless, the page may still have much content relevant to those searching for lawyers. A conventional search system that only seeks documents that use at least some of the query terms would not identify the document using the word “attorney” when responding to queries such as “best lawyers” or “best divorce lawyers.” Li also discusses some detriments of HVV, such as the fact that ranking of the documents does not depend on the words appearing in the documents satisfying a given query. Thus, although a Web page may be very popular and hence be the object of many hyperlinks, the content may not be the most relevant to the received query. Moreover, it is also possible to intentionally mislead users of HVV-based search engines by creating a Web page including a large number of hyperlinks all pointing to the same page in order to inflate artificially the connectivity-based ranking of the referenced page. Such techniques are commonly known as spam. Nevertheless, if one can defeat such spamming schemes and other drawbacks, it appears clear that the number of inbound links to a given Web page provide a useful measure of its popularity and perhaps its quality.
As discussed above, the language used in queries by users of search engines is often not the most precise expression of the desired concept. A need thus exists for a connectivity-based indexing system that better uses the information provided by hyperlinks.