This invention relates generally to a method for identifying, filtering, ranking and cataloging information elements; as for example, Internet, World Wide Web pages, considered in whole, in part, or in combination; and more particularly, to a method, preferably implemented in computer software, for interactively creating an information database including preferred information elements, the method including steps for enabling a user to interactively create a frame-based, hierarchical organizational structure for the information elements, and steps, thereafter, for identifying by iteration and automatically filtering and ranking by degree of relevance information elements, for populating the frames of the structure to form; for example, a searchable, World Wide Web page database. In further detail, the method features steps for enabling a user to interactively define a frame-based, hierarchical information structure for cataloging information, and, steps for identifying information elements to populate respective frames of the structure by iteration, the iteration including steps for: identifying a preliminary population of information elements with the use of a search query based on respective frame attributes, frame attributes selectively including classification designations, example pages, stop pages and/or control parameters used by conventional search engines, as required; supplementing preliminary population based on usage of example pages and/or stop pages; expanding the supplemented preliminary population to include related information; automatically filtering and computing information element ranking based on degree of relevance to the respective frame; and, thereafter, refining the identification with successive iterations of the steps described until identification is deemed complete, whereupon the hierarchical structure is populated with a user-defined portion of preferred information elements identified.
As a yet further problem, and potentially an even more perplexing one, not only has The computer revolution created a greater need for information, but, undeniably, it has created an abundance, indeed, an overabundance of information to meet that need. In fact, the computer revolution has spawned so much information, that it is now to the point where the amount of information available on most subjects is typically so large as to create the new and associated problems of going through that wealth of information, and selecting from it the items most relevant to the question at hand.
For example, in the case of the Internet""s World Wide Web, if one were looking for information concerning something as straightforward as the restoration of an old car, there likely would be hundreds, if not thousands, of potential Web sites having as many if not more pages of information relating to the subject of old cars, and the parts, services and techniques for their restoration. Accordingly, one faced with the problem of developing information on the subject of automobile restoration, would potentially be required to locate and go through literally hundreds of Web pages in an attempt to find those few most suited to his needs.
In the past, the World Wide Web""s approach to this problem has been to provide search facilities such as Yahoo(copyright)! and others, to assist Web users in finding the information; i.e., Web pages, they might be looking for. However, search facilities such as Yahoo! typically provide only generalized organizations of Web subject matter, those organizations being arranged as categories of Web pages, the categories and the things included in them being based on the nature of the Web sites, the subjective points of view of numerous staff classifiers working for the search facility, and the classification criteria they established. In accordance with this approach, organization of the information is, therefore, influenced by the respective points of view of the various classifiers, the providers of the search facilities, and the Web site providers. As a result, such Web subject matter organizations tend to be subjective and suffer from over inclusion and under inclusion of information, which, in turn, affects their relevance, accuracy and ease of use.
Moreover, and of yet greater concern, is the fact that formulating and maintaining organizations of Web subject matter in the fashion noted requires expenditure of substantial amounts of human time and effort and, accordingly, money. Particularly, continuous growth and change in Web makeup requires such organizations of Web information to be repeatedly supplemented and the existing framework revised to accommodate the introduction of new and changing information. Accordingly, such approaches are manpower intensive, leading to higher costs for creation and maintenance, and because of the extensive human involvement, are, as well, subject to error.
Still further, such search facilities, typically, are unable to group the information elements they return; e.g., Web pages, by their respective xe2x80x9crelevancexe2x80x9d, that is, the degree to which others have referred to; i.e., pointed to, the respective elements; e.g., pages, as sources of information on the subject matter in question. Pages that have many references pointing to them are termed herein xe2x80x9cauthoritiesxe2x80x9d. In this scheme, and in the context of Web pages, xe2x80x9crelevancexe2x80x9d is a function of the number and quality of links to an authority page from various hub pages, referred to as the xe2x80x9cauthority weightxe2x80x9d for the respective authority page, or, the number and quality of links from a hub page to various authority pages, referred to as the xe2x80x9chub weightxe2x80x9d for the respective hub page. Moreover, and as will be appreciated, pages of higher relevance; i.e. higher authority weight or higher hub weight, are xe2x80x9cpreferredxe2x80x9d where one is seeking information concerning particular subject matter. Accordingly, xe2x80x9cpreferredxe2x80x9d information elements; e.g., Web pages, are considered to have higher relevance to some specific subject matter where the information elements; e.g., Web pages, have either, higher authority weight, or, higher hub weight with respect to the particular subject matter. And, as will also be appreciated, since information elements; e.g., Web pages, may both point to authority pages; i.e. function as a hub, and also be pointed to as an authority; i.e., function as an authority, such pages may be relevant either as a hub page or as authority page, or as both.
No prior references has proposed systems or methods for enabling a user to interactively create an information database of xe2x80x9cpreferredxe2x80x9d data elements such as xe2x80x9cpreferredxe2x80x9d Web pages; i.e., pages of either higher authority weight, or hub weight; i.e. xe2x80x9crelevancexe2x80x9d, or, procedures for removing spurious factors that arise during computation of hub and authority weights for the respective pages.
With regard to relevance; i.e. weight, computation, workers in the field have found that the computational accuracy is adversely affected by such factors as xe2x80x9cself-promotionxe2x80x9d, xe2x80x9crelated-page promotion,xe2x80x9d, xe2x80x9chub redundancyxe2x80x9d, xe2x80x9ccopied pagesxe2x80x9d, and xe2x80x9cfalse authority.xe2x80x9d Particularly, it has been found that during relevance computations pages with links to other pages of the same Web site can improperly confer authority upon themselves, thus giving rise to false promotion; i.e., xe2x80x9cself-promotion,xe2x80x9d and adversely affecting relevance computation accuracy. Further, it has been found that in addition to self-promotion, related pages from the same Web site, as for example, a home page and several sub-pages of the home page can improperly accumulate authority weights, giving rise to false promotion in the form of xe2x80x9crelated-page promotionxe2x80x9d, which again adversely affects relevance computation accuracy.
Further still, workers have found that a page may have value only because of the hub links it contains; that is, its content may be otherwise irrelevant. In that case, if the hub links for such a page can be found in other pages, the hub links of such a page are redundant and may not be suitable for inclusion. It is to be noted that often, the value of a hub page resides in the links that it possesses, and not the content of the page. Accordingly, where all the links of a hub page can be found in xe2x80x9cbetterxe2x80x9d hub pages; i.e., hub pages having greater numbers of relevant links, and where the content of the hub page is otherwise not of interest, inclusion of the first hub page gives rise to xe2x80x9chub redundancyxe2x80x9d which reduces the effectiveness of the computation.
Continuing, spurious results have also been found to be introduced into relevance computations by the now common practice of Web site providers including in their sites material copied from other Web sites. Because of the economic and creative pressures on Web site providers to produce xe2x80x9ccontentxe2x80x9d, providers often copy page or page parts from others rather than generate new and original material for their sites. Though this approach may violate rights of the originator in the work, since little effort or cost is required, Web site providers find this a particularly fast and convenient way of generating site content, and are especially inclined to take this approach where the subject matter copied has become popular.
Regrettably, however, existence of multiple copies of hub and/or authority pages adversely affect relevance computations. For example, multiple copies of hub pages erroneously increase the authority weight of pages pointed to, the same material being pointed to each time a hub is copied. Likewise, multiple copies of authority pages also produce problems. Particularly, copies of the same authority page split; i.e. divide, the number of links pointing to the same subject matter; i.e., the hubs links pointing to the authority subject matter are dispersed over the copies. As will be appreciated, if there was only one copy of the authority, all hubs links for the authority would point to that one copy, thereby, consolidating the effect of the links. However, if the hub links rather point to different ones of the multiple authority copies, the total number of links that would otherwise be available is dissipated over the multiple copies. Accordingly, and as is apparent, the occurrence of xe2x80x9ccopied pagesxe2x80x9d adversely affects accuracy of the relevance computation.
And, still further, it has been found that certain pages pertaining to a number of unrelated topics; e.g., pages of resource compilations, typically refer to; i.e., are linked to, a number of other pages, and accordingly appear as if they are xe2x80x9cgood hubs,xe2x80x9d even though many of the associated links point to pages of unrelated subject matter, which in turn causes the relevant links from the same page to become xe2x80x9cfalse authoritiesxe2x80x9d, which, once again, adversely affects the accuracy of relevance computation.
In addition, not only have previously proposed methods concerning links and computation of hub and authority weights failed to suggest or disclose interactive creation of information databases for preferred-authority data elements such as Web pages, or, procedures for removing spurious factors that arise during computation of the relevance weights, but further, prior approaches have failed to appreciate the importance and benefit derived from including xe2x80x9cexamplexe2x80x9d pages which may be xe2x80x9cseededxe2x80x9d into the computation so as to drive computation in a desired direction; i.e., identify pages considered relevant to the subject matter of interest. Likewise, prior methods concerning hub and authority weight computations have also failed to consider express exclusion from computation of pages found not desirable, such non-desirable pages serving to bias the computation in unwanted directions; i.e., identify pages considered irrelevant to the subject matter of interest.
With respect to previously proposed methods concerning computation of hub and authority weights, J. Kleinberg, for example, in his U.S. patent application entitled: xe2x80x9cMethod and System for Identifying Authoritative Information Resources in an Environment with Content-based Links Between Information Resourcesxe2x80x9d, Ser. No. 08/813,749, filed Mar. 7, 1997 and now U.S. Pat. No. 6,112,202 and assigned to the assignee of the current application, describes a method for automatically identifying the most authoritative Web pages from a large set of hyperlinked Web pages. More specifically, Kleinberg explains his method applies to the cases where; for example, one has a page whose content is of interest, and desires to find other pages which are authoritative with respect to the content of the page of interest. However, while Kleinberg notes his method includes: steps for conducting a search based upon a query composed from the content of the page of interest; steps for, thereafter, expanding the group of pages initially retrieved with pages that are linked to the pages initially retrieved; and finally, steps for iteratively computing the relevance of the pages retrieved based upon the xe2x80x9cweightsxe2x80x9d for the respective page link structures, his method fails to consider the interactive creation by a user of a database structure for the information, or optimization of the relevance computation by removal of spurious factors which adversely effect accuracy. Still further, Kleinberg fails to consider inclusion and/or exclusion, respectively, of desirable and undesirable information elements to influence the results of computation.
Likewise, S. Chakrabarti et al. in their pending U.S. patent application entitled, xe2x80x9cMethod and System for Filtering of Information Entitiesxe2x80x9d, Ser. No. 08/947,221 filed Oct. 8, 1997, also assigned to the assignee of the current application, describes a method for determining the xe2x80x9caffinityxe2x80x9d of information elements, the method including steps for first obtaining an initial set of information elements, thereafter, steps for expanding the initial set with xe2x80x9crelatedxe2x80x9d information elements, and subsequently, iteratively computing the relative affinity for the respective information elements. However, as in the case of Kleinberg, Chakrabarti et. al. fail to consider or describe facilities for enabling a user to interactive create a database structure for the information, or optimization of the xe2x80x9caffinityxe2x80x9d computation by removing spurious factors which adversely effect accuracy. Yet further, Chakrabarti et al., like Kleinberg, fail to disclose or suggest procedures for aiding computation by the inclusion of steps for introducing example information elements; e.g., example Web pages, into the process in order to direct the computation in a desired direction, or excluding undesired information elements; e.g., undesired Web pages, from the process in order to avoid the computation being taken in undesired directions.
Accordingly, it is an object of the present invention to provide a method for identifying, ranking and cataloging information.
Additionally, it is an object of the present invention to provide a method for interactively creating and or modifying an information database including preferred information elements such as preferred, World Wide Web pages, considered in whole, in part, or in combination.
Further, it is an object of the present invention to provide a method for improving the determination of relevance amongst related information elements such as hyperlinked, Web pages, considered in whole, in part or in combination.
Yet further, it is an object of the present invention to provide a method for improving the determination of relevance amongst related information elements such as Web pages, considered in whole, in part, or in combination, by the filtering to reduce the effects of spurious factors which adversely effect accuracy.
Still further, it is an object of the present invention to provide a method for enabling a user to interactively develop a personalized database structure for information organized in accordance with the user preferences, which may be subsequently populated with preferred information elements such as hyperlinked, World Wide Web pages collected by the user.
Yet further, it is also an object of the present invention to provide a method for improving the determination of relevance amongst related information elements such as Web pages by introducing example information elements such as example Web pages into the process to direct the determination in a desired direction.
As well, it is an object of the present invention to provide a method for improving the determination of relevance amongst related information elements such as Web pages by excluding undesired information elements such as undesired Web pages from the process to avoid the determination being taken in undesired directions.
Yet additionally, it is also an object of the present invention to provide a method for enabling users to interactively develop databases of preferred information elements, which databases may be subsequently searched conveniently and efficiently to identify information elements such as World Wide Web pages, in whole, in part or in combination. having relevance to subject matter of interest.
Briefly, to achieve at least one of the above and other objects and advantages, the method of the present invention features steps for enabling a user to interactively create and/or modify an information database having a hierarchical, frame-based organizational structure of the user""s selection, the frames of the structure for receiving automatically retrieved, preferred information elements, such as World Wide Web pages, taken in whole, in part, or in combination, the pages being preferred based on relevance to respective frames, the preferred pages being identified by information queries submitted by the user for search, and subsequent computation and filtering of page relevance undertaken by iteration.
More specifically, in accordance with the method, the information elements are defined as, one or more statements of authority that form a unit of reference, such as part, or all of a Web page, or a number of Web pages in combination, that are found to have relevance to subject matter of interest as determined by improved, automated computation of weights for link between information elements; e.g., weights for hyperlink between Web pages. Additionally, the method features procedures for filtering the information elements to diminish spurious effects which adversely affect computation of relevance. Still further, the invention in preferred form includes steps for introducing into the process example information elements; e.g., example Web pages, found to be desirable so as to bias the computations in a desired direction, and steps for excluding undesired information elements; e.g., Web pages, so as to suppress biasing of the computation in unwanted directions.
In the interests of simplicity, and to assist understanding, in the following discussion and throughout the specification, usage of the more specific terms xe2x80x9cpage(s)xe2x80x9d and xe2x80x9cWeb site(s)xe2x80x9d will be employed to exemplify, and should be understood to embrace, respectively, the more general terms xe2x80x9cinformation element(s)xe2x80x9d and xe2x80x9cinformation source(s)xe2x80x9d unless otherwise expressly stated. Further, and as noted, an information element will be considered as including one or more statements of authority, as for example, one or more Web page hyperlinks, contained in a Web page, part of a page, or a number of Web pages, which form a unit of reference.
With the above in mind, it is to be noted that in preferred form, the method of the present invention is implemented in computer software suitable to be run on a conventional personal computer having a central processing unit, associated RAM, ROM and disk storage memory, and accompanying input-output devices, such as keyboard, pointing device, display monitor and printer. In preferred form the method includes program steps for facilitating generation of a display; at; for example, the computer monitor, the display featuring an interface for enabling a user to interactively compose and or modify an adjustable, frame-based, hierarchical organizational structure representing an arrangement of topics of the user""s design. In accordance with the invention, the user formulates the frame-based organization structure to receive information elements, such as Web pages, in whole, in part or in combination, which may be subsequently automatically collected with the method employing further input from the user to populate the various frames of the organizational structure based on the respective frame attributes, which attributes may include classification designations, example pages, stop pages and/or control parameters used by conventional search engines, as required.
In preferred form, the interface includes one or more screens respectively having multiple partitions for presenting: a graphical representation of the frame-based, hierarchical information structure of the users creation; the Web pages contained in the category frames of the structure, and the components employed in selecting the Web pages for populating the frames. More particularly, the interface features a graphical presentation of the frame-based hierarchical information structure, together with associated tools for freely navigating and modifying the structure; as for example, by adding, deleting or moving frames within the structure to represent the tastes and preferences of the user. Additionally, the interface includes partitions for displaying the Web pages associated with a user-selected frame of the organizational structure, together with tools for manipulating and managing the pages included at the frame. And, still further in preferred form, the interface includes partitions and associated tools for enabling the user to view respective Web page content, such as pages and page links, associated with selected frames, and the frame attributes.
Based on this interface presentation, the user may create search queries for identifying pages which following iterative processing may be employed to populate the frames of the organizational structure. In this regard, classification designations, example pages, stop pages and control parameters may be selectively and alternatively combined as required to form query terms employed in the iterative identification process.
Also in this regard, it is to be understood that frame attributes may function as contributors to query terms, and that various query terms may be used for multiple purposes. For example, frame attributes may contribute query terms appropriate for use in generating an initial set of Web pages for consideration, and additionally be employed for determining link weights during computation. More specifically, while frame attributes may define the subject matter categories of the organizational structure; i.e., function as classification designators, and, therefore, be suitable for initially retrieving pages relevant to those categories, the frame attributes as query terms may also be used to increase the weight afforded a link by virtue of the query term falling within a predetermined xe2x80x9cwindowxe2x80x9d of text from the link, thereby, suggesting heightened relevance for the link by virtue of its proximity to the query term as will be more fully described in connection with the detailed description of the preferred embodiment hereafter.
Further, frame attributes as query terms may also include, and, indeed, exclusively include identification of example hub pages and authority pages, the identities of which may be made part of a query to bias the relevance computation in desired directions. Additionally, and as noted, query terms may also include stop pages, i.e., identification of pages for avoidance which have been found to bias the relevance computation in undesired directions, as well as control parameters helpful for managing the extent and amount of CPU, memory and storage resources used during searching, as are well known in the art.
Also in preferred form, computation of Web page relevance is undertaken by defining a Web page and its associated links, as embracing a hub page, and/or an authority page, wherein a hub page, xe2x80x9cpoints toxe2x80x9d; i.e., links to, one or more authorities pages, and an authority page, is xe2x80x9cpointed toxe2x80x9d; i.e. linked to, by one or more hub pages. In this regard, and as noted, usage of the term xe2x80x9cWeb pagexe2x80x9d applies to part of a page, a whole page, and a combination of pages which may, respectively, constitute one or more statements of authority that form a unit of reference.
Continuing, the method includes steps for constructing a xe2x80x9croot setxe2x80x9d of Web pages likely to be relevant to a topic selected by the user. The root set is developed by first generating an initial set of Web pages with the use of a conventional query derived from the local and inherited attributes of the category frame for the database hierarchical organizational structure the user is interested in populating, the query so derived, thereafter, being first applied in conventional fashion against the World Wide Web. As described, frame attributes may selectively include frame classification designations, example pages, stop pages, and/or control parameters, as required.
Following return of the initial set of pages responsive to the query, the initial set is supplemented based on whether example pages and/or stop pages were specified. Particularly, in the case where example hubs were specified, preferably, any page pointed to by an example hub is used to supplement the initial set; i.e., brought into the initial set. Further, in the case where example authority pages were specified, the initial set is preferably supplemented by including any page that points to at least any two example authority pages. Additionally, to the extent that stop pages have been specified in the query, such stop pages are eliminated from the initial set. Further, once the initial set is supplemented as described, the supplemented initial set is then expanded by including pages directly linked to pages of the supplemented initial set; i.e., pages that are either pointed to by pages of the supplemented initial set, or pages that point to pages of the supplemented initial set, which, as will be appreciated, would include specified example hub pages and specified example authority pages. Finally, the specified stop pages would again be eliminated from the expanded, supplemented initial set; i.e., root set, to cover the possibility of stop pages having been drawn in during the expansion process.
In this regard, the method thus includes steps for generating an initial set of pages based upon frame attributes as described, and then through an iterative process of issuing queries and following links into and out of already fetched pages, the iteration is carried out until as described the initial set is supplemented and expanded to form the xe2x80x9croot setxe2x80x9d upon which later computation can be performed.
Following creation of the root set, the method includes steps for associating a hub-weight parameter and authority-weight parameter with each Web page, and iteratively calculating the relevance for the pages of the root set based on the resulting, respective, hub-weight and authority-weight values for each page.
In accordance with the method, the hub weights and authority weights of the respective pages are based on summations of respective authority weights and hub weights for the links of the pages. In this regard, and, as will be described hereafter, weights for respective links may be increased to reflect the significance of the link. In accordance with the method, the calculation produces a distribution of scores that represent the degree of relevance for the respective pages, which scores are, thereafter, ordered by numerical value to establish rankings of the pages. Specifically, the computation produces hub and authority weights for all pages, and then returns both a predetermined portion of the highest-ranking hub pages and highest-ranking authority pages.
In accordance with the invention, the method additionally features steps for improving computational accuracy of the relevance for the Web pages. Specifically, the method features steps executed during the computation of relevance for filtering spurious computational factors such as xe2x80x9cself-promotionxe2x80x9d, xe2x80x9crelated-page promotionxe2x80x9d, xe2x80x9chub redundancyxe2x80x9d, xe2x80x9ccopied pagesxe2x80x9d and xe2x80x9cfalse authority.xe2x80x9d In preferred form, the method includes steps for filtering xe2x80x9cself-promotionxe2x80x9d from the computation, the steps including the discarding of objectionable links between pages, from the same Web Site. Further the method includes steps for filtering xe2x80x9crelated-page promotionxe2x80x9d from the computation, which steps include xe2x80x9cre-packingxe2x80x9d the Web pages, for any Web site, having multiple pages showing non-zero authority, during which re-packing, all authorities other that the largest authority being set to zero.
Still further, the method in preferred form also includes steps for filtering xe2x80x9chub redundancyxe2x80x9d, the steps including identifying the highest weight; i.e., xe2x80x9cbest,xe2x80x9d hub during computation, zeroing the authority values of all pages pointed to by that hub, re-computing hub values; and thereafter, outputting the next best hub, zeroing authority values of pages it points to, and so forth.
Regarding xe2x80x9ccopied pagesxe2x80x9d, the method in preferred form also features steps for diminishing the adverse effect on relevance computation caused by copied pages. Specifically, the method features steps prior to computation of relevance for determining whether two or more pages can be considered copies of one another by means of a xe2x80x9csimilarityxe2x80x9d checking procedure, canceling all but one of the pages, the retained page being deemed the original, redirecting the links to the copies found to the page deemed the original, and increasing the weight of the links from the page deemed the original by adding a factor representing the significance of the multiple copies of the original page having been made. Particularly, in preferred form, the factor used to increase link weight for links of copied pages is made equal to the log of the number of copies found of the page.
And, yet additionally the method in preferred form features steps for filtering xe2x80x9cfalse authorityxe2x80x9d, the steps including: allowing each link in a Web page to have its own hub value; incrementing the authority value of the destination page with the hub value of the link when authority values are calculated; and re-computing the hub values of the original link with the authority value of the destination page, and accordingly, by a spreading function, the hub values of neighboring links. Furthermore, the final hub value of the page, is made the sum of the hub values of its links.
Further, and as noted, in connection with computation of page hub weight and authority weight, respective weights of link within a page may be increased beyond a default value to reflect relevance. For example, first, where a query term appears at a distance xe2x80x9cdxe2x80x9d within a window xe2x80x9cWxe2x80x9d of terms from the link, a factor is added to link weight which is made proportional to [Wxe2x88x92d]. As will be appreciated, the physical proximity of a search term to a link implies relevance for the link to the search term and, accordingly, the query. Additionally, and thereafter, where copied pages have been found, and all but one deemed the original eliminated, to reflect the significance of the page having been copied, the weight of the links for the retained page are increased, particularly, and as noted, by a multiplication factor equal to the log of the number of copies applied to link weight. Subsequently, and still further, where example pages are used, because of the importance of respective example pages, the weight of their respective links within an example page are likewise increased. More specifically, the weights of all links within example hub pages are increased by a predetermined multiplication factor; and in the case of example authority pages, the weight of links within an authority page are increased by first identifying a page region, and thereafter, applying a multiplication factor to the weight of any link within the region depending on the number of example links found within a window of predetermined size located at such a subject link within the identified region.
Still further, in preferred form, the method in accordance with the invention includes steps for ranking the pages of the root set based on relevance following computation of page hub and authority weights, and to thereafter, truncating the root set to a number of highest ranking pages prescribed by the user.