The ever-increasing universe of electronic information, for example as found on the World Wide Web (herein after referred to as the Web), competes for the effectively fixed and limited attention of people. Both consumers and producers of information want to understand what kinds of information are available, how desirable it is, and how its content and use change through time.
Making sense of very large collections of linked documents and foraging for information in such environments is difficult without specialized aids. Collections of linked documents are often connected together using hypertext links. The basic structure of linked hypertext is designed to promote the process of browsing from one document to another along hypertext links, which is unfortunately very slow and inefficient when hypertext collections become very large and heterogeneous. Two sorts of aids have evolved in such situations. The first are structures or tools that abstract and cluster information in some form of classification system. Examples of such would be library card catalogs and the Yahoo! Web site. The second are systems that attempt to predict the information relevant to a user's needs and to order the presentation of information accordingly. Examples would include search engines such as Lycos, which take a user's specifications of an information need, in the form of words and phrases, and return ranked lists of documents that are predicted to be relevant to the user's need.
Another system which provides aids in searching for information on the Web is the "Recommend" feature provided on the Alexa Internet Web site (URL: http://www.alexa.com). The "Recommend" feature provides a list of related Web pages that a user may want to retrieve and view based on the Web page that they are currently viewing.
It has been determined that one way to facilitate information seeking is through automatic categorization of Web Pages. One technique for categorization of Web pages is described by P. Pirolli, J. Pitkow and R. Rao in the publication entitled Silk from a Sow's Ear: Extracting Usable Structures from the Web, Conference on Human Factors in Computing Systems (CHI 96), Vancouver British Columbia, Canada, April 1996. Described therein is a categorization technique wherein each Web page is represented as a feature vector, with features extracted from information about text-content similarity, hypertext connections, and usage patterns. Web pages belonging to the same category, may then be clustered together. Categorization is computed based on inter-document similarities among these feature vectors.
Another aid for making sense of such collections is clustering. One way to approach the automatic clustering of linked documents is to adapt the existing approaches of clustering standard text documents. Such an approach is described by Cutting et al., in the publication entitled "Scatter/gather: A cluster based approach to browsing large document Collections", The 15.sup.th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318-329, August 1992. However, there are several impracticalities with such existing text-clustering techniques. Text-based clustering typically involves computing inter-document similarities based on content-word frequency statistics. Not only is this often expensive, but, more importantly, its effectiveness was developed and tuned on human-readable texts. It appears, though, that the proportion of human-readable source files for Web pages is decreasing with the infusion of dynamic and programmed pages.
Another option for performing clustering of document collections is to look at usage patterns. Unfortunately, any clustering based on usage patterns requires access to data that is not usually recorded in any easily accessible format. In the case of the Web, while a moderate amount of usage information is recorded for each requested document at a particular Web site, the log files for other sites are not publicly accessible. Thus while the usage for a particular site can be ascertained, this information is not available for the other 500,000 Web sites that currently exist.
Other attempts at clustering hypertext typically utilize the hypertext link topology of the collection. Such techniques are described by R. A. Botafogo, E. Rivlin, and B. Schneiderman, Structural Analysis of Hypertexts: Identifying Hierarchies And Useful Metrics, ACM Transactions on Information Systems, 10(2):142-180, 1992. Such a basis for clustering makes intuitive sense since the links of a particular document represent what the author felt was of interest to the reader of the document. These known clustering methods have been applied to collections with several hundred elements, and do not seem particularly suited to scale gracefully to large heterogeneous collections like the Web, where it has been estimated that there are over 70 million text-based documents which currently exist.
While clustering or categorizing at the document level is important, it is also desirable to be able to cluster at the collection or web site level. Search engines such as Yahoo.RTM. can return results to queries separated into categories, web sites and web pages. While this may provide a categorization of sorts relative to the search terms, it does not take into account any relationships that the sites themselves may define. For example, a response to a query may return sites A and B. Suppose site A contains links to site X and Site B contains a link to site Y. The set generated as a result of the query does not indicate the relationships between sites A and X or sites B and Y. This is important because of the inconsistent use of terms as meta-data for the sites could result in some relevant sites being missed.
Other publications relevant to the invention of the present application:
Larson, Ray R., Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace, Proceedings of 59.sup.th ASIS Annual Meeting held in Baltimore Md., edited by Steve Hardin, Vol. 33:71-78, Information Today Inc., 1996.