The present invention is related to the field of analysis of linked collections of documents, and in particular to the clustering of collections of linked documents, e.g. World Wide Web Sites, to identify collections having similar content.
The ever-increasing universe of electronic information, for example as found on the World Wide Web (herein after referred to as the Web), competes for the effectively fixed and limited attention of people. Both consumers and producers of information want to understand what kinds of information are available, how desirable it is, and how its content and use change through time.
Making sense of very large collections of linked documents and foraging for information in such environments. is difficult without specialized aids. Collections of linked documents are often connected together using hypertext links. The basic structure of linked hypertext is designed to promote the process of browsing from one document to another along hypertext links, which is unfortunately very slow and inefficient when hypertext collections become very large and heterogeneous. Two sorts of aids have evolved in such situations. The first are structures or tools that abstract and cluster information in some form of classification system. Examples of such would be library card catalogs and the Yahoo! Web site (URL: http://www.yahoo.com). The second are systems that attempt to predict the information relevant to a users needs and to order the presentation of information accordingly. Examples would include search engines such as Lycos (URL: http://www.lycos.com), which take a user""s specifications of an information need, in the form of words and phrases, and return ranked lists of documents that are predicted to be relevant to the user""s need.
Another system which provides aids in searching for information on the Web is the xe2x80x9cRecommendxe2x80x9d feature provided on the Alexa Internet Web site (URL: http://www.alexa.com). The xe2x80x9cRecommendxe2x80x9d feature provides a list of related Web pages that a user may want to retrieve and view based on the Web page that they are currently viewing.
It has been determined that one way to facilitate information seeking is through automatic categorization of Web Pages. One technique for categorization of Web pages is described by P. Pirolli, J. Pitkow and R. Rao in the publication entitled Silk from a Sow""s Ear: Extracting Usable Structures from the Web, Conference on Human Factors in Computing Systems (CHI 96), Vancouver British Columbia, Canada, April 1996. Described therein is a categorization technique wherein each Web page is represented as a feature vector, with features extracted from information about text-content similarity, hypertext connections, and usage patterns. Web pages belonging to the same category, may then be clustered together. Categorization is computed based on inter-document similarities among these feature vectors.
Another aid for making sense of such collections is clustering. One way to approach the automatic clustering of linked documents is to adapt the existing approaches of clustering standard text documents. Such an approach is described by Cutting et al., in the publication entitled xe2x80x9cScatter/gather: A cluster based approach to browsing large document Collectionsxe2x80x9d, The 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318-329, August 1992. However, there are several impracticalities with such existing text-clustering techniques. Text-based clustering typically involves computing inter-document similarities based on content-word frequency statistics. Not only is this often expensive, but, more importantly, its effectiveness was developed and tuned on human-readable texts. It appears, though, that the proportion of human-readable source files for Web pages is decreasing with the infusion of dynamic and programmed pages.
Another option for performing clustering of document collections is to look at usage patterns. Unfortunately, any clustering based on usage patterns requires access to data that is not usually recorded in any easily accessible format. In the case of the Web, while a moderate amount of usage information is recorded for each requested document at a particular Web site, the log files for other sites are not publicly accessible. Thus while the usage for a particular. site can be ascertained, this information is not available for the other 500,000 Web sites that currently exist.
Other attempts at clustering hypertext typically utilize the hypertext link topology of the collection. Such techniques are described by R. A. Botafogo, E. Rivlin, and B. Schneiderman, Structural Analysis of Hypertexts: Identifying Hierarchies And Useful Metrics, ACM Transactions on Information Systems, 10(2):142-180, 1992. Such a basis for clustering makes intuitive sense since the links of a particular document represent what the author felt was of interest to the reader of the document. These known clustering methods have been applied to collections with several hundred elements, and do not seem particularly suited to scale gracefully to large heterogeneous collections like the Web, where it has been estimated that there are over 70 million text-based documents which currently exist.
While clustering or categorizing at the document level is important, it is also desirable to be able to cluster at the collection or web site level. Search engines such as Yahoo(copyright) can return results to queries separated into categories, web sites and web pages. While this may provide a categorization of sorts relative to the search terms, it does not take into account any relationships that the sites themselves may define. For example, a response to a query may return sites A and B. Suppose site A contains links to site X and Site B contains a link to site Y. The set generated as a result of the query does not indicate the relationships between sites A and X or sites B and Y. This is important because of the inconsistent use of terms as meta-data for the sites could result in some relevant sites being missed.
Other publications relevant to the invention of the present application:
Larson, Ray R., Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace, Proceedings of 59th ASIS Annual Meeting held in Baltimore Md., edited by Steve Hardin, Vol. 33:71xe2x80x9478, Information Today Inc., 1996.
A method and apparatus for identifying collections of linked documents is disclosed. In the method the links from a set of related documents are analyzed to identify a plurality of document collections. By analyzing only the link structure, a process intensive content analysis of the documents is avoided. A citation analysis technique, such as co-citation analysis, is performed on the set of documents to extract link information indicating links and link frequency between document collections. For co-citation analysis that information would include the frequency that both are linked to by another document collection. By using the link information, related document collections may then be identified using a suitable analysis technique, such as clustering or spreading activation.
The method of the present invention is preferably practiced on the documents and document collections found on the World Wide Web (web sites) and utilizes a form of citation analysis known as co-citation analysis. The method is generally comprised of the steps of: obtaining a list of web pages having some predetermined relationship; extracting a set of web sites from said list of web pages; creating a co-citation list, said co-citation list comprised of pairs of web sites that are linked to by the same web site and the frequency or number of occurrences of links from the same web site; and performing a suitable operation utilizing said co-citation list, such as clustering or spreading activation, to find related web sites.