1. Field of Use
This invention relates generally to a method for cataloging, filtering and ranking information;, as for example, World Wide Web pages of the Internet; and more particularly, to a method preferably implemented in computer software for interactively creating an information database including preferred information elements such as preferred-authority World Wide Web pages, the method. including steps for enabling a user to interactively create a frame-based, hierarchical organizational structure for the information elements, and steps for identifying and automatically filtering and ranking by relevance, information elements, such as World Wide Web pages for populating the structure, to form, for example, a searchable, World Wide Web page database; the method featuring steps for enabling a user to interactively define a frame-based, hierarchical information structure for cataloging information, identify a preliminary population of information elements for a particular hierarchical category arranged as a frame, based upon the respective frame attributes, and thereafter, expand the information population to include related information, and subsequently, automatically filter and rank the in formation based upon relevance, and then populate the hierarchical structure with a definable portion of the filtered, upper-ranked information element, in the case of World Wide Web pages, the method features steps for enabling a user to interactively establish a hierarchical database structure having frames defined as categories of information of user interest, searching for and collecting a preliminary population of Web pages of interest based upon the respective frame attributes of the hierarchy, subsequently expanding the population based upon links either actual or virtual associated with the pages, followed by filtering and ranking the pages based upon the relevance of the pages derived from the authority of the links, and thereafter, limiting the population to desired number of upper-ranked pages.
2. Related Art
The computer revolution has spawned so much information, that it is now to the point where the amount of information available on most subjects is typically so large as to create the new and associated problems of going through that wealth of information and selecting from it the specific pieces of information most relevant to the question at hand.
For example, in the case of the Internet""s World Wide Web, if one were looking for information concerning something as straightforward as the restoration of an old car, there would likely be hundreds, if not thousands, of potential Web sites having as many if not more pages of information related to the subject. Accordingly, one faced with the problem of developing information on the subject of automobile restoration, would potentially be required to locate and go through literally hundreds of Web pages in an attempt to find those few most suited to his needs.
In the past, the World Wide Web""s approach to this problem has been to provide so-called search facilities such as Yahoo!(copyright). and others, to assist Web users in finding the information, i.e., Web pages, they might be looking for. However, search facilities such as Yahoo! typically only provide general organizations of Web subject matter and associated Web pages, those organizations being arranged as categories of Web subject matter that are based on the subjective points of view of the individuals who compile the information for the respective search facilities, or the points of view of the respective providers of the search facilities, or the points of view of the Web information providers, or some combination of all of these points of view. As a result, such Web subject matter organizations are susceptible to over inclusion and under inclusion of information which effects the accuracy and ease-of-use of the respective search facilities.
Still further, such search facilities, typically, are unable to group the information elements they return i.e., pages, by their respective xe2x80x9cauthoritativenessxe2x80x9d that is, the degree to, which others have referred to the respective elements, i.e., pages, as sources information on the subject matter in question. Pages that have many references pointing to them are termed herein xe2x80x9cauthorities.xe2x80x9d On the other hand, pages that themselves point to many authorities can be referred to as xe2x80x9chubs.xe2x80x9d
While some workers in the field of information retrieval have noted the importance of xe2x80x9clinksxe2x80x9d between hub and authority information elements such as Web pages, and computation of their respective authoritativeness weights, none have proposed systems or methods for enabling a user to interactively create an information databases of preferred-authority data elements such as Web pages, or, procedures for removing spurious factors that arise during computation of the authoritativeness weights for the respective pages.
With regard to the accuracy of authoritativeness computation, workers in the field have found that the computational accuracy is adversely affected by such factors as xe2x80x9cself-promotionxe2x80x9d, xe2x80x9crelated-page promotionxe2x80x9d, xe2x80x9chub redundancyxe2x80x9d, and xe2x80x9cfalse authority.xe2x80x9d Particularly, it has been found that during authoritativeness, computations pages with links to other pages of the same Web site can improperly confer authority upon themselves, thus giving rise to false promotion, i.e., xe2x80x9cself-promotion,xe2x80x9d and adversely affect authoritativeness computation accuracy. Further, it has been found that in addition to xe2x80x9cself-promotionxe2x80x9d, related pages from the same Web site, as for example, a home page and several sub-pages of the home page, can improperly accumulate authority weights, giving rise to false promotion in the form of xe2x80x9crelated-page promotionxe2x80x9d, which again adversely affect authoritativeness computation accuracy. Still further, workers have found that the value of a hub page resides in the links that it processes, and not, typically, the content of the page. Accordingly, where all the links of a hub page can be found in xe2x80x9cbetterxe2x80x9d hub pages, i.e., hub pages having a greater number of relevant links, inclusion of the first hub page gives rise to xe2x80x9chub redundancyxe2x80x9d which unnecessarily burdens computation. And, still further, it has been found that certain pages pertaining to a number of unrelated topics, e.g., pages of resource compilations, typically refer to, i.e., are linked to a number of other pages, and accordingly appear as if they are xe2x80x9cgood hubsxe2x80x9d even though many of the associated links point to pages of unrelated subject matter. This in turn causes the relevant links from the same page to become xe2x80x9cfalse authoritiesxe2x80x9d, which, once again, adversely affecting accuracy of authoritativeness computation.
For example, J. Kleinberg in his U.S. patent application entitled: xe2x80x9cMethod and System for Identifying Authoritative Information Resources in an Environment with Content-based Links Between Information Resourcesxe2x80x9d, Ser. No. 08/813,749, filed Mar. 7, 1997 now U.S. Pat. No. 6,112,202 and assigned to the assignee of the current application, describes a method for automatically identifying the most authoritative Web pages from a large set of hyperlinked Web pages. More specifically, Kleinberg explains his method applies to the case where, for example, one has a page whose content is of interest, and desires to find other pages which are authoritative with respect to the content of the page of interest. However, while Kleinberg notes his method includes steps for conducting a search based upon a query composed from the content of the page of interest; steps for, thereafter, expanding the group of pages initially retrieved with pages that are linked to the pages initially retrieved; and finally, steps for iteratively computing the authoritativeness of the pages retrieved based upon the xe2x80x9cweightsxe2x80x9d for the respective page link structures, his method fails to consider the interactive creation by a user of a database structure for the information, or optimization of the authoritativeness computation by removal of spurious of factors which adversely effect accuracy.
Likewise, S. Chakrabarti et al. in their U.S. patent application entitled, xe2x80x9cMethod and System for Filtering of Information Entitiesxe2x80x9d, Ser. No. 08/947,221 filed Oct. 8, 1997 now pending, also assigned to the assignee of the current application, describes a method for determining the xe2x80x9caffinityxe2x80x9d of information elements, the method including steps for first obtaining an initial set of information elements, thereafter, steps for expanding the initial set with xe2x80x9crelatedxe2x80x9d information elements, and subsequently, iteratively computing the relative affinity for the respective information elements, However, as in the case of Kleinberg, Chakrabarti et al. fails to consider or describe facilities for enabling a user to interactively create a database structure for the information, or optimization of the xe2x80x9caffinityxe2x80x9d computation by removing spurious factors which adversely effect accuracy.
Accordingly, it is an object of the present invention to provide a method for cataloging and ranking information.
Additionally, it is an object of the present invention to provide a method for interactively creating and/or modifying an information database including preferred information elements such as preferred-authority, World Wide Web pages.
Further, it is an object of the present invention to provide a method for improving the determination of authoritativeness amongst related information elements such as hyperlinked, World Wide Web pages.
Yet further, it is an object of the present invention to provide a method for improving the determination of authoritativeness amongst related information elements such as Web pages by the filtering spurious factors which adversely effect accuracy.
Still further, it is an object of the present invention to provide a method for enabling a user to interactively develop a personalized database structure for information organized in accordance with the user preferences, which may be subsequently populated with preferred-authority information elements such as hyperlinked, World Wide Web pages collected by the user.
Yet additionally, it is also an object of the present invention to provide a method for enabling a user to interactively develop a database of preferred-authority information elements, which database may be subsequently searched conveniently and efficiently to identify information elements such as World Wide Web pages of preferred-authority.
Briefly, to achieve at least one of the above and other objects and advantages, the method of the present invention includes steps for enabling a user to interactively create and/or modify an information database featuring a hierarchical, frame-based, organizational structure of the user""s selection for receiving information elements, such as World Wide Web pages, also of the user selection. Further, the method features steps for enabling the identification of information elements, such as Web pages, having preferred-authority as determined by improved, automated computation of the link structure between information elements.
In the interest of simplicity, and to assist understanding, in the following discussion and throughout the specification, usage of the more specific terms xe2x80x9cpage(s)xe2x80x9d and xe2x80x9cWeb site(s)xe2x80x9d will be employed to include, and understood to embrace, respectively, the more general terms xe2x80x9cinformation element(s)xe2x80x9d and xe2x80x9cinformation source(s)xe2x80x9d unless otherwise expressly stated.
With the above thought in mind, it is to be noted that in preferred form, the method of the present invention is implemented in computer software suitable to be run on a conventional personal computer having a central processing unit, associated RAM, ROM and disk storage memory, and accompanying input-output devices, such as keyboard, pointing device, display monitor and printer. In preferred form the method includes program steps for facilitating generation of a display at, for example, the computer monitor, the display featuring an interface for enabling a user to interactively compose and/or modify an adjustable, frame-based, hierarchical organizational structure representing an arrangement of topics of the user""s design. In accordance with the invention, the user formulates the frame-based organization structure to receive information elements, such as World Wide Web pages, which the users may subsequently select to populate the various frames of the organizational structure based on the respective frame attributes, i.e. descriptive features. In preferred form, the interface includes one or more screens respectively having multiple partitions for presenting: a graphical representation of the frame-based, hierarchical information structure of the user""s creation; the Web pages contained in the category frames of the structure; and the components employed in selecting the Web pages for populating the frames. More particularly, the interface features graphical presentation of the frame-based hierarchical information structure, together with associated tools for freely navigating and modifying the structure; for example, by adding, deleting or moving frames within the structure to represent the tastes and preferences of the user. Additionally, the interface includes partitions for displaying the Web pages associated with a user-selected frame of the organizational structure, together with tools for manipulating and managing the pages included at the frame. And, still further in preferred form, the interface includes partitions and associated tools for enabling the user to view respective Web page content, such as page links, associated with selected frames, and the frame attributes used as query terms for initiating automated generation of preferred-authority, Web pages for populating the frames of the organizational structure.
Further, in preferred form, computation of Web page authoritativeness is undertaken, by defining Web page and associated link structure as including hub pages, and authority pages, wherein a hub page, xe2x80x9cpoints toxe2x80x9d, i.e., links to, one or more authorities pages, and an authority page, is xe2x80x9cpointed toxe2x80x9d, i.e. linked to, by one or more hub pages. Further, the method includes steps for constructing a root set of information Web pages likely to be relevant to a topic selected by the user. The root set is developed by first generating an initial set of Web pages with the use of a conventional query derived from the attributes of the category frame for the database hierarchical organizational structure the user is interested in populating, the query so derived, thereafter, being applied in conventional fashion against the World Wide Web. Further, the method includes steps for, subsequently, expanding the initial set of Web page returned responsive to the query, to include page elements directly linked to the Web pages of the initial set, thus forming the root set.
Following creation of the root set, the method includes steps for associating a hub-weight parameter and authority-weight parameter for each Web page, and iteratively calculating the authoritativeness of the respective pages of the root set based on the resulting, respective, hub-weight and authority-weight values for each page.
In accordance with the invention, the method additionally features steps for improving computational accuracy of the authoritativeness for the Web pages. Specifically, the method features steps executed during the computation of authoritativeness for filtering spurious computational factors such as xe2x80x9cself-promotionxe2x80x9d, xe2x80x9crelated-page promotionxe2x80x9d, xe2x80x9chub redundancyxe2x80x9d, and xe2x80x9cfalse authority.xe2x80x9d In preferred form, the method includes steps for filtering xe2x80x9cself-promotionxe2x80x9d from the computation, the steps including the discarding of links between pages, from the same Web site. Further, the method includes steps for filtering xe2x80x9crelated-page promotionxe2x80x9d from the computation, which steps include xe2x80x9cre-packingxe2x80x9d the Web pages, for any Web site, having multiple pages showing non-zero authority, during which re-packing, all authorities other than the largest authority is set to zero.
Still further, the method in preferred form also includes steps for filtering xe2x80x9chub redundancyxe2x80x9d, the steps including identifying the highest weight, i.e., xe2x80x9cbestxe2x80x9d, hub during computation, zeroing the authority values of all pages pointed to by that hub, re-computing hub values, and, subsequently, recalculating authoritativeness. And, yet additionally the method in preferred form includes steps for filtering xe2x80x9cfalse authorityxe2x80x9d, the steps including: allowing each link in a Web page to have its own hub value; incrementing the authority value of the destination page with the hub value of the link when authority values are calculated; and re-computing the hub values of the original hub page with the authority value of the destination page, and accordingly, by a spreading function, the hub values of neighboring links. As will be appreciated, this makes the final hub value of the page, therefore, the integral of the hub values of its links.
Still further, in preferred form, the method in accordance with the invention includes steps for ranking the pages of the root set based on authoritativeness following computation of page hub and authority weights, and to thereafter, truncate the root set to a number of highest ranking pages prescribed by the user.