1. Field of Use
This invention relates generally to a method for cataloging, filtering and ranking information; as for example, World Wide Web pages of the Internet; and more particularly, to a method preferably implemented in computer software for interactively creating an information database including preferred information elements such as preferred-authority World Wide Web pages, the method including steps for enabling a user to interactively create a frame-based, hierarchical organizational structure for the information elements, and steps for identifying and automatically filtering and ranking by relevance, information elements, such as World Wide Web pages for populating the structure, to form, for example, a searchable, World Wide Web page database; the method featuring steps for enabling a user to interactively define a frame-based, hierarchical information structure for cataloging information, identify a preliminary population of information elements for a particular hierarchical category arranged as a frame, based upon the respective frame attributes, and thereafter, expand the information population to include related information, and subsequently, automatically filter and rank the information based upon relevance, and then populate the hierarchical structure with a definable portion of the filtered, upper-ranked information element, in the case of World Wide Web pages, the method features steps for enabling a user to interactively establish a hierarchical database structure having frames defined as categories of information of user interest, searching for and collecting a preliminary population of Web pages of interest based upon the respective frame attributes of the hierarchy, subsequently expanding the population based upon links either actual or virtual associated with the pages, followed by filtering and ranking the pages based upon the relevance of the pages derived from the authority of the links, and thereafter, limiting the population to desired number of upper-ranked pages.
2. Related Art
The computer revolution has spawned so much information, that it is now to the point where the amount of information available on most subjects is typically so large as to create the new and associated problems of going through that wealth of information and selecting from it the specific pieces of information most relevant to the question at hand.
For example, in the case of the Internet's World Wide Web, if one were looking for information concerning something as straightforward as the restoration of an old car, there would likely be hundreds, if not thousands, of potential Web sites having as many if not more pages of information related to the subject. Accordingly, one faced with the problem of developing information on the subject of automobile restoration, would potentially be required to locate and go through literally hundreds of Web pages in an attempt to find those few most suited to his needs.
In the past, the World Wide Web's approach to this problem has been to provide so-called search facilities such as Yahoo!.RTM.. and others, to assist Web users in finding the information, i.e., Web pages, they might be looking for. However, search facilities such as Yahoo! typically only provide general organizations of Web subject matter and associated Web pages, those organizations being arranged as categories of Web subject matter that are based on the subjective points of view of the individuals who compile the information for the respective search facilities, or the points of view of the respective providers of the search facilities, or the points of view of the Web information providers, or some combination of all of these points of view. As a result, such Web subject matter organizations are susceptible to over inclusion and under inclusion of information which effects the accuracy and ease-of-use of the respective search facilities.
Still further, such search facilities, typically, are unable to group the information elements they return i.e., pages, by their respective "authoritativeness" that is, the degree to, which others have referred to the respective elements, i.e., pages, as sources information on the subject matter in question. Pages that have many references pointing to them are termed herein "authorities." On the other hand, pages that themselves point to many authorities can be referred to as "hubs."
While some workers in the field of information retrieval have noted the importance of "links" between hub and authority information elements such as Web pages, and computation of their respective authoritativeness weights, none have proposed systems or methods for enabling a user to interactively create an information databases of preferred-authority data elements such as Web pages, or, procedures for removing spurious factors that arise during computation of the authoritativeness weights for the respective pages.
With regard to the accuracy of authoritativeness computation, workers in the field have found that the computational accuracy is adversely affected by such factors as "self-promotion", "related-page promotion", "hub redundancy", and "false authority." Particularly, it has been found that during authoritativeness, computations pages with links to other pages of the same Web site can improperly confer authority upon themselves, thus giving rise to false promotion, i.e., "self-promotion," and adversely affect authoritativeness computation accuracy. Further, it has been found that in addition to "self-promotion", related pages from the same Web site, as for example, a home page and several sub-pages of the home page, can improperly accumulate authority weights, giving rise to false promotion in the form of "related-page promotion", which again adversely affect authoritativeness computation accuracy. Still further, workers have found that the value of a hub page resides in the links that it processes, and not, typically, the content of the page. Accordingly, where all the links of a hub page can be found in "better" hub pages, i.e., hub pages having a greater number of relevant links, inclusion of the first hub page gives rise to "hub redundancy" which unnecessarily burdens computation. And, still further, it has been found that certain pages pertaining to a number of unrelated topics, e.g., pages of resource compilations, typically refer to, i.e., are linked to a number of other pages, and accordingly appear as if they are "good hubs" even though many of the associated links point to pages of unrelated subject matter. This in turn causes the relevant links from the same page to become "false authorities", which, once again, adversely affecting accuracy of authoritativeness computation.
For example, J. Kleinberg in his U.S. patent application entitled: "Method and System for Identifying Authoritative Information Resources in an Environment with Content-based Links Between Information Resources", U.S. Ser. No. 08/1813,749, filed Mar. 7, 1997 and now U.S. Pat. No. 6,112,202 and assigned to the assignee of the current application, describes a method for automatically identifying the most authoritative Web pages from a large set of hyperlinked Web pages. More specifically, Kleinberg explains his method applies to the case where, for example, one has a page whose content is of interest, and desires to find other pages which are authoritative with respect to the content of the page of interest. However, while Kleinberg notes his method includes steps for conducting a search based upon a query composed from the content of the page of interest; steps for, thereafter, expanding the group of pages initially retrieved with pages that are linked to the pages initially retrieved; and finally, steps for iteratively computing the authoritativeness of the pages retrieved based upon the "weights" for the respective page link structures, his method fails to consider the interactive creation by a user of a database structure for the information, or optimization of the authoritativeness computation by removal of spurious of factors which adversely effect accuracy.
Likewise, S. Chakrabarti et al. in their U.S. pending patent application entitled, "Method and System for Filtering of Information Entities", U.S. Ser. No. 08/947,221 filed Oct. 8, 1997, also assigned to the assignee of the current application, describes a method for determining the "affinity" of information elements, the method including steps for first obtaining an initial set of information elements, thereafter, steps for expanding the initial set with "related" information elements, and subsequently, iteratively computing the relative affinity for the respective information elements. However, as in the case of Kleinberg, Chakrabarti et al. fails to consider or describe facilities for enabling a user to interactively create a database structure for the information, or optimization of the "affinity" computation by removing spurious factors which adversely effect accuracy.