1. Field of the Invention
The present invention relates to a method of processing a corpus of electronically stored documents, and in particular to a method of processing arbitrarily large document collections useful in document browsing.
2. Description of Related Art
Document browsing is a powerful tool used in accessing large text collections. BrowsinG, which can be distinguished from searching because browsing is query-free, works well for information needs either too general or too vague to be usefully expressed as a query in some search language. For example, a user may be unfamiliar with vocabulary appropriate for describing a topic of interest, or may not wish to commit to a particular choice of words. Indeed, a user may not be looking for anything specific at all, but instead may wish to explore the general information content of the collection. Helpful in this context is an information access system including a navigable collection outline that both suggests the collection's contents and allows a user to focus attention on some topic-coherent subset of the contents.
One such browsing system is described in a paper entitled "Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections," Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pages 318-329, June 1992, by Cutting, Karger, Pedersen, and Tukey, which is incorporated herein by reference. This system and method are also disclosed in the above-incorporated application Ser. No. 07/790,316, now U.S. Pat. No. 5,442,778.
In the Scatter/Gather method, attention is always directed toward a focus set of documents potentially interesting to a user. Initially the focus set may be an entire document collection. The documents in the focus set are clustered into a small number of topic-coherent subsets, or clusters, of documents. The terms "clustering" and "scattering" are used synonymously; thus it may be said that the documents in the focus set are scattered into the clusters.
In Scatter/Gather, cluster summaries, a table of contents outlining the documents of the focus set, are developed and presented to the user, for example on a computer display screen. The user then identifies and selects clusters that appear most interesting. The selected clusters define a new, smaller focus set that is the union of the selected clusters. The process is repeated a desired number of times until the user wishes to access documents individually or employ a query-based search method.
Cluster summaries comprise suggestive text determined automatically from documents in each cluster. Each cluster summary includes two types of information: a list of topical words occurring most often in the documents of the cluster, and the titles of a few typical documents in the cluster. The summaries are based on cluster profiles, which reflect words appearing in documents in the cluster.
Scatter/Gather is not necessarily a stand-alone information access tool, but can be used in tandem with search methods such as boolean search or similarity search. Illustrative is an analogy to paper copies of reference books, which offer two access modes: a table of contents in the front for browsing, and an index in the back for more directed searches. Scatter/Gather is not necessarily used to find particular documents, but instead, by giving exposure to the vocabulary presented in cluster summaries, helps guide complimentary search methods. For example, a cluster profile may be used in a similarity search in a query against the entire collection. Conversely, Scatter/Gather can be used to organize the results of word-based queries that retrieve too many documents.
An example of Scatter/Gather will now be described. FIG. 4 represents a Scatter/Gather session over a text collection of about 5,000 articles posted to the New York Times News Service in August, 1990. Single-word labels instead of actual cluster summaries are presented, to simplify the figure.
Suppose a user's information need is to determine generally what happened in August 1990. It would be difficult to construct a word-based query effectively representing this information need, because no specific topic description exists. The user might consider general topics, such as "international events," but that topic description would not be effective because articles concerning international events typically do not use those words.
With Scatter/Gather, rather than being forced to provide certain terms, a user is presented with a set of cluster summaries--an outline of the collection. The user need select only those clusters that seem potentially relevant to the topic of interest. In FIG. 4, the major stories of the month are immediately obvious from the initial scattering: Iraq invades Kuwait, and Germany considers reunification. This leads a user to focus on international issues, selecting the "Iraq," "Germany," and "Oil" clusters. These three clusters are gathered together to form a smaller focus set.
This smaller focus set is then reclustered on the fly, or scattered, to produce eight new clusters covering the reduced collection. Because the reduced collection contains a subset of the articles, these new clusters reveal a finer level of detail than the original eight. The articles on the Iraqi invasion and some of the oil articles have now been separated into clusters discussing the U.S. military deployment, the effects of the invasion upon the oil market, and hostages in Kuwait.
Suppose the user adequately understands these major stories, but wishes to discover what happened in other parts of the world. The user therefore selects the "Pakistan" cluster, which also contains other foreign political stories, and a cluster containing articles about Africa. This reveals a number of specific international situations as well as a small collection of miscellaneous international articles. The user thus learns of a coup in Pakistan and about hostages being taken in Trinidad, stories otherwise lost among the more major stories of that month.
A further illustration of Scatter/Gather in operation appears in FIG. 3. Text collection (or focus set) 20 is an online-version of Grolier's encyclopedia (roughly 64 Megabytes of ASCII text) with each of the twenty-seven thousand articles treated as a separate document. Suppose the user is interested in investigating the role of women in the exploration of space. Rather than attempting to express this information need as a formal query, the user instead selects a number of top-level clusters referenced as 22A-I that, from their description, seem relevant to the topic of interest. In this case, the user selects the clusters 22A, 22C and 22H labeled "military history," "science and industry," and "American society" to form a reduced corpus (or focus set) 24 of the indicated subset of articles from Grolier's. (Note, the cluster labels are idealized in this illustration; the actual implementation produces cluster descriptions that are longer than would fit conveniently in this figure. The given labels, however, are reasonable glosses of topics described by actual cluster summaries.)
The reduced corpus is then reclustered on the fly to produce a new set of clusters 26A-J covering the reduced corpus 24. Since the reduced corpus contains a subset of the articles in Grolier's, these new clusters are at a finer level of detail than the top-level clusters. The user again selects clusters of interest. In this case, these include clusters 26E, 26G and 26H labeled "aviation," "engineering," and "physics." Again, a further reduced corpus 28 is formed and reclustered. The final set of clusters 30A-F includes clusters labeled for "military aviation," "Apollo program," "aerospace industry, .... weather," "astronomy" and "civil aviation." At this stage the clusters are small enough for direct perusal via an exhaustive list of article titles. Assuming at least one article of interest is found, the user may find more articles of a similar nature in the same cluster, or may use a directed search method, based on the vocabulary of the located article or of the cluster description, to find additional articles.
Previous work in document clustering generally concentrated on procedures with running times that are quadratic relative to the collection size, for example, the classic SLINK single-linkage clustering procedure (see Sibson, R., "SLINK: An Optimally Efficient Algorithm for the Single Link Cluster Method," Computer Journal, 16:3034, 1973). Quadratic running time is too time-inefficient for interactive manipulation of large collections, containing thousands of documents--days or even months may be required to perform a single clustering.
Linear procedures, such as those described in application Ser. No. 07/790,316,: now U.S. Pat. No. 5,442,778, reduce the time required to only a few minutes, fast enough for searching moderately large collections and the results of broad word-based queries. (A rate of approximately 3000 documents per minute may be achieved on a Sun Microsystems SPARCSTATION 2 using Scatter/Gather.) Even linear-time clustering, however, is too slow to support interactive browsing of very large document collections. This is particularly apparent when one considers applying Scatter/Gather to the TIPSTER collection, a DARPA standard for text retrieval evaluation containing about 750,000 documents (See Harmon, D., "The TIPSTER Evaluation Corpus ", CDROM Disks of Computer Readable Text, 1992, available from the Linguistic Data Consortium). At 3000 documents per minute, this requires around four hours to scatter--far too long to be considered interactive.