1. Field of Invention
The present invention relates generally to the field of determining communities of hyperlinked documents. More specifically, the present invention is related to determining communities of hyperlinked documents based on the relationships of the links between the documents and the structure of the documents.
2. Discussion of Prior Art
As hyperlinked environments grow in size and complexity, it becomes increasingly difficult to locate documents relevant to a given query. One such environment which is growing at a phenomenal rate is the world wide web (WWW). As millions of on-line participants continually create hyperlinked content, there are no capabilities to impose a global structure and consequently the capability to efficiently find the most relevant documents for a broad-topic search through traditional search methods, e.g. text based queries, becomes a much more difficult challenge to overcome. For example, a user searching for information about Harvard University on the WWW utilizing a text search would receive over 80,000 pages from the search. The number of returned pages is an unmanageable number for the user and determining which ones are the most relevant would consume a considerable amount of the user""s time. What the user requires is a way to locate the most central, or authoritative, pages on the topic xe2x80x9cHarvard.xe2x80x9d
An algorithm for locating authoritative documents within a hyperlinked environment has been proposed by Jon Kleinberg in a recent paper, incorporated herein by reference, xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d Proc. ACM-SIAM Symposium on Discrete Algorithms, May 1997 (also appears as IBM Research Report RJ 10076, May 1997 and is additionally available at http://www.cs.cornell.edu/home/kleinber/ on the world wide web). Kleinberg""s algorithm is based on two premises. First, the implicit annotation provided by human creators of hyperlinks contains sufficient information to obtain a notion of authority. Secondly, sufficiently broad topics contain communities of hyperlinked pages. These communities comprise two sets of inter-related pages. One set comprises authorities (i.e. highly referenced) on the topic. The second set comprises pages which xe2x80x9cpointxe2x80x9d to many of the authorities. This second set is referred to as hubs because the elements of the set represent strong central points to confer authority on the relevant pages. The two sets of pages exhibit a mutually reinforcing relationship, that is, a good hub points to many authorities while good authorities are pointed to by many hubs. This notion of hubs and authorities is utilized to determine the pages which are the most relevant on a broad topic by using an iterative algorithm to break the apparent circularity of hubs and authorities.
Increasingly, web pages are being viewed with devices other than regular desktops and standard browsers. Cell phones, palm-top computers with limited screen space and speech-based devices are a few of the alternative devices becoming prevalent. In addition, there are moves to ensure web page content is available for users with limited abilities (blind, dyslexic, illiterate, etc.). The World Wide Web Consortium Accessibility Initiative provides the documents xe2x80x9cWeb Content Accessibility Guidelines 1.0xe2x80x9d and xe2x80x9cTechniques for Web Content Accessibility Guidelines 1.0,xe2x80x9d both of which are incorporated herein by reference, which describe how to format pages in structured forms so that clients on the alternative devices can process the pages. The current recommendation and notes, respectively, are available from the W3C and, additionally, at http://www.w3.org/WAI/GL/WCAG10 for xe2x80x9cWeb Content Accessibility Guidelines 1.0xe2x80x9d and http://www.w3.org/TR/1999/WAI-WEBCONTENT-TECHS-19990505/ for xe2x80x9cTechniques for Web Content Accessibility Guidelines 1.0.xe2x80x9d To illustrate, one of the recommendations is the use of ALT text tags for images which allows browsers or support programs sitting on the client side or proxy servers to present the information contained in figures using visually-displayed text, synthesized speech or braille. For client side programs to process a page, the most important aspect of the web page is that it should follow a more stringent structure format than that allowed for traditional browsers. Poorly formed pages, while they may contain useful information, are essentially useless for clients with limited capabilities because the transform engines that pre-process these pages for rendering can not perform an adequate job. Kleinberg""s algorithm determines authoritative pages irrespective of their structure. However, some of the authoritative pages are essentially useless to the individual who wishes to view them.
Therefore, there is a need to return the most authoritative pages which provide the most use, i.e., poorly formed pages need to be penalized because the pages may not be able to be displayed (visual, auditory, tactile, etc.) in a manner appropriate for the limited abilities of the browser or the user.
A method of determining the documents of a hyperlinked environment which are authorities on a given topic which most closely meet guidelines related to document structure is presented. A base set of documents which is relatively small, containing documents relevant to a given topic, and containing many of the strongest authorities on the topic is obtained. Each document within the set is evaluated and given a structure score which reflects how well-formed the document is. Each document within the set also has corresponding hub and authority weights which are updated and maintained to determine the strongest authorities. The initial hub and authority weights of each document are set to the corresponding structure score of the document. An iterative algorithm is then utilized to determine the strongest authorities. For each round of the algorithm, the authority weights of a document are updated by summing the hub weights of each document pointing to the document, while the hub weights of a document are updated by summing the authority weights of each document which is pointed to by the document whose hub weight is being determined. After a series of iterations, the documents having the highest authority weights are identified as the strongest authorities on the query topic.
In a further embodiment, the base set of documents is obtained by obtaining a root set of documents and determining the base set from the root set. A root set is first obtained by taking a given number of the highest ranked documents returned form a textual based searching and ranking system. The base set is generated from the root set by including documents which are linked to documents within the root set.
In a further embodiment, the number of documents included within the base set is limited so as to maintain a relatively small base set. All documents outside of the root set which are pointed to by documents within the root set are included. However, only a limited number of documents outside of the root set which point to documents within the root set are included.
In further embodiment, the structure score is determined by evaluating each document within the set according to a set of parameters. For each parameter, the document is assigned a parameter score. These parameter scores are then weighted and summed to obtain the documents structure score.