The World Wide Web (WWW) is a networked collection of hyperlinked documents available over the Internet. The WWW exemplifies the problem of meaningfully selecting information resources that pertain to a given topic. In particular, users need a way to manage information pertaining to lasting interest in a broad topic, and to find a way to render vast collections of resources pertaining to such a topic comprehensible. Search engines and indices such as Yahoo! are the most widespread solution attempts (see http://www.yahoo.com, visited Jan.26, 1998).
A resource is an embodiment of information or an embodiment of a collection of information. Examples of resources include files, text documents, multimedia documents, a hyperlinked file (such as a web page), and a collection of hyperlinked files (such as a web site, or simply "site").
Known techniques that address the problem of selecting information resources from a large collection thereof typically raise the level of abstraction at which users interact with the large collection. Researchers have sought to define useful, higher-level structures that can be extracted from hypertext collections, such as "collections", "localities", "patches" or "books." See Pitkow, J. and Pirolli, P., Life, "Death, and Lawfulness on the Electronic Frontier," Mar. 22-27, 1997, pages 383-390; Pirolli, P., Pitkow, J., and Rao, R., "Silk From a Sow's Ear: Extracting Usable Structures from the Web," Apr. 13-18, 1996, pages 118-125;and Card, S. K., Robertson, G. G., and York, W., "The WebBook and the Web Forager: An Information Workspace for the World-Wide Web," April 13-18, 1996, pages 111-117. This approach opens up four major avenues of innovation: definitions of new structures, algorithms to extract the structures, visualization techniques that enable users to comprehend the structures, and interface techniques that create a workspace in which it is easy to specify, modify, and experiment with the structures.
Pitkow and Pirolli report cluster algorithms based on co-citation analysis. See Pitkow, J. and Pirolli, P., Life, "Death, and Lawfulness on the Electronic Frontier," Mar. 22-27, 1997,pages 383-390. See also Garfield, E., Citation Indexing,"ISI Press," 1979. The intuition is that if two documents, say A and B, are both cited by a third document, this is evidence that A and B are related. The more often a pair of documents is co-cited, the stronger the relationship. They applied two algorithms to Georgia Tech's Graphic Visualization and Usability Center web site and were able to identify interesting clusters.
Card, Robertson, and York describe the WebBook, which uses a book metaphor to group a collection of related web pages for viewing and interaction, and the WebForager, an interface that lets users view and manage multiple WebBooks. See Card, S. K., Robertson, G. G., and York, W., "The WebBook and the Web Forager: An Information Workspace for the World-Wide Web," Apr. 13-18, 1996, pages 111-117. They also present a set of automatic methods for generating such collections of related pages, such as recursively following all relative links from a specified web page, following all (absolute) links from a page one level, extracting "book-like" structures by following "next" and "previous", and grouping pages returned from a search query.
Pirolli, Pitkow, and Rao defined a set of functional roles that web pages can play, such as "head" (roughly the "front door" of a group of related pages), "index", and "content". See Pirolli, P., Pitkow, J., and Rao, R., "Silk From a Sow's Ear: Extracting Usable Structures from the Web," Apr. 13-18, 1996, pages 118-125. They then developed an algorithm that used hyperlink structure, text similarity, and user access data to categorize pages into the various roles. They applied these algorithms to the Xerox web site and were able to categorize pages with good accuracy.
Mackinlay, Rao, and Card developed a novel interface for accessing articles from a citation database. See Mackinlay, J.D., Rao, R. and Card, S.K., "An Organic User Interface for Searching Citation Links," May 7-11, 1995, pages 67-73. The central user interface object is a "butterfly", which represents one article, its references, and its citers. The interface makes it easy for users to browse from one article to a related one, group articles, and generate queries to retrieve articles that stand in a particular relationship to the current article.
Mukherjea et al. (Mukherjea, S., Foley, J. D., and Hudson, S., "Visualizing Complex Hypermedia Networks Through Multiple Hierarchical Views," May 7-11, 1995,pages 331-337) and Botafogo et al. (Botafogo, R. A., Rivlin, E., and Shneiderman, B., "Structural Analysis of Hypertexts: Identifying Hierarchies and Useful Metrics," April, 1992, pages 142-180) report on algorithms for analyzing arbitrary networks, splitting them into structures (such as "pre-trees" or hierarchies) that are easier for users to visualize and navigate.
Other efforts propose novel ways to view and navigate information structures. The Navigational View Builder combines structural and content analysis to support four viewing strategies: binding, clustering, filtering and hierarchization. See Mukherjea, S., Foley, J. D., and Hudson, S., "Visualizing Complex Hypermedia Networks Through Multiple Hierarchical Views," May 7-11, 1995, pages 331-337. Through the extensive use of single user operations on multiple windows, the Elastic Windows browser provides efficient overview and sense of current location in information structures. See Kandogan, E. and Shneiderman, B., Proceedings of UIST '97, "Elastic Windows: A Hierarchical Multi-Window World-Wide Web Browser." Lamping et al. explored hyperbolic tree visualization of information structures. See Lamping, J., Rao, R., and Pirolli, P., "A Focus+Context Technique Based on Hyperbolic Geometry for Visualizing Large Hierarchies," May 7-11, 1995, pages 401-408. A product called "twURL" helps users view and manage collections of URLs, organizing URLs into outlines based on properties such as server, domain, and number of incoming links. See "What is twURL?" http://www.roir.com/whatis.htm. Furnas presents a theory of how to create structures that are easy for users to navigate. See Furnas, G. W., "Effective View Navigation," Mar.22-27, 1997, pages 367-374.
Somewhat less directly related are the SenseMaker and Scatter/Gather systems. See Pirolli, P., Schank, P., Hearst, M., Diehl, C., "Scatter/Gather Browsing Communicates the Topic Structure of a Very Large Text Collection," Apr. 13-18, 1996, pages 213-220.SenseMaker supports users in the contextual evolution of their interest in a topic. The focus is on making it easy for users to view and manage the results of a query and to create new queries based on the existing context. Scatter/Gather supports the browsing of large collections of text allowing users to iteratively reveal topic structure and locate desirable documents.
Purely structural analysis of known techniques can be somewhat useful, but such techniques used for the WWW concentrate on links between pages, while ignoring the fact that numerous pages are related to each other (e.g., they belong to the same group of pages called a "site"). This ignores critical contextual and/or functional information about the web pages that can be valuable in helping the user interact intelligently with on-topic information on the WWW. The functional roles a web page can play should also be taken into account in solving this problem.
Most known WWW techniques either take a base collection of pages as given (often all the web pages rooted at a particular URL like www.xerox.com), or focus on methods for supporting users in creating base collections. A base collection can be thought of partly as a starting point for gathering information on a topic. For example, a user can identify several key web sites or articles that are highly pertinent to a topic of interest as the base collection. As used herein, the term "seed document" refers to a resource in a base collection. The base collection is referred to as the "seed set." Some automated techniques for creating WWW seed sets are known, but the basic unit out of which the sets are built is a single web page. Thus, the resulting sets are local, and do not generally take into account the larger collection of resources (e.g., the entire WWW) of which they are typically a part. Indeed, some known techniques are limited to more or less a single site. A more global solution that encompasses more far-flung information is needed. Another important problem is that large collections of hyperlinked resources such as the WWW often consist of many ecologies of dynamic, evolving documents. Any technique for selecting topic-relevant resources from such a collection should maintain an acceptable level of accuracy in such an environment.