The present invention relates generally to computerized information retrieval systems and methods, and more specifically to a system and method of obtaining an importance ranking for a hierarchical collection of objects, such as documents or pages in a linked database such as the world wide web.
In recent years, users of host computers connected to the Internet have increasingly employed application programs such as web browsers and search engines to search for information contained in documents or “pages” on the world wide web (WWW or “web”). In a typical search for one or more “web pages” of interest, a user of a host computer (“host”) composes a search query containing one or more specified keywords, and submits the query to a search engine, e.g., the Google™, AltaVista™, or Excite™ search engine, via a web browser, e.g., the Microsoft Internet Explore™ or Netscape Navigator™ web browser. In response to the user's query, the search engine typically searches a “snapshot” (i.e., a cached version) of the web for pages containing the specified keywords. Such a snapshot of the web generally stores a list of web pages that is indexed based upon certain words and/or phrases that may be found in the contents of the pages. The indexed list of web pages may be generated using one or more web crawler or “spider” programs that operate to fetch pages from the web. Because the world wide web may be regarded as a hyperlinked database, each web page fetched by a spider program may contain one or more hyperlinks to one or more other web pages. A spider program can employ these hyperlinks to fetch additional web pages. The search engine searches the indexed list of web pages, identifies as many web pages as possible containing the keywords specified in the query, and generates a result set including a list of the identified pages. Because the result set generated by the search engine may list hundreds if not thousands or millions of web pages, the search engine generally ranks the web pages based upon their relevance to the user's query and their “importance” relative to one another, thereby assuring that relevant and important pages appear at or near the top of the list.
One known technique for obtaining an importance ranking of web pages is the PageRank™ technique, which employs hyperlinks from one web page to another as indicators of a web page's importance. According to the PageRank™ technique, a hyperlink from a first web page to a second web page effectively operates as a “vote” by the first page for the second page. Such a hyperlink may indicate that the author of the first web page thinks highly of the content of the second web page. As the number of votes for the second web page increases, the importance of that page increases. The PageRank™ technique not only considers the number of votes cast for a particular web page, but also takes into account the importance of the web pages casting the votes. For example, the PageRank™ technique may give more weight to votes cast by important web pages than to votes cast by pages that are deemed to be either unimportant or of lesser importance. The PageRank™ technique is typically employed in conjunction with one or more keyword matching techniques to identify web pages that are both important and relevant to a user's search query.
One drawback of the PageRank™ technique is that it generally fails to consider any structure inherent in the world wide web and how the web pages to be ranked fall within that structure when generating an importance ranking. For example, the web may be regarded as having a hierarchical structure based upon the domains and sub-domains of the web, the categories and sub-categories of the subject matter contained in the web pages, or based upon any other suitable construct for hierarchically classifying documents or pages on the web. Although the PageRank™ technique may take into account the number of votes cast by web pages and the importance of the pages casting the votes, the PageRank™ technique does not generally consider the importance of the web pages relative to their vicinity to one another within any structure of the web. As a result, search engines that employ the PageRank™ technique may be incapable of retrieving web pages that are the most important and relevant to a user's query.
It would therefore be desirable to have a system and method of obtaining an importance ranking for objects in a linked database that takes into account any structure of the database.