1. Field of the Invention
This invention relates to a method and apparatus for an almost-constant-time clustering of electronic documents. In particular, this invention is directed to partitioning a large corpus of electronic documents into a much smaller set of clusters at nearly constant time.
2. Description of Related Art
Document browsing is a powerful tool used in accessing large text collections. Browsing, which can be distinguished from searching because browsing is query-free, works well for information needs either too general or too vague to be usefully expressed as a query in some search language. For example, a user may be unfamiliar with vocabulary appropriate for describing a topic of interest, or may not wish to commit to a particular choice of words. Indeed, a user may not be looking for anything specific at all, but instead may wish to explore the general information content of the collection.
An information access system is helpful in this context. The information access system includes a navigable collection outline that both suggests the collection's contents and allows a user to focus attention on some topic-coherent subset of the contents. Such browsing systems are described, for example, in U.S. Pat. Nos. 5,442,778 to Pedersen et al. (Scatter/Gather) and 5,483,650 to Pedersen et al., each incorporated herein by reference in its entirety.
In Scatter/Gather, attention is always directed toward a focus set of documents potentially interesting to a user. Initially, the focus set may be an entire document collection. The documents in the focus set are clustered into a small number of topic-coherent subsets, or clusters, of documents. The terms "clustering" and "scattering" are used synonymously; thus it may be said that the documents in the focus set are scattered into the clusters.
In Scatter/Gather, cluster summaries are developed and presented to the user. The cluster summaries are usually tables of contents outlining the documents of the focus set. Cluster summaries include suggestive text determined automatically from the documents in each cluster. Each cluster summary includes two types of information: a list of topical words occurring most often in the documents of the cluster; and the titles of a few typical documents in the cluster. The summaries are based on cluster profiles, which reflect words appearing in documents in the cluster.
The user then identifies and selects clusters that appear most interesting. The selected clusters are gathered together to define a new, smaller focus set That is, the new focus set is the union of the documents in the selected clusters. The process is repeated a desired number of times until the user wishes to access documents individually or to employ a query-based search method.
Scatter/Gather is not necessarily a stand-alone information access tool. Rather, Scatter/Gather can be used in tandem with search methods such as boolean searching or similarity searching. An illustrative analogy is reference books, which offer two access modes: a table of contents in the front for browsing, and an index in the back for more directed searches. Scatter/Gather is not necessarily used to find particular documents. Rather, by giving exposure to the vocabulary presented in cluster summaries, Scatter/Gather helps guide complementary search methods. For example, a cluster profile may be used in a similarity search as a query against the entire collection. Conversely, Scatter/Gather can be used to organize the results of word-based queries that retrieve too many documents.
FIG. 9 represents a Scatter/Gather process applied to a text collection of about 5,000 articles posted to the New York Times News Service in August, 1990. In FIG. 9, single-word labels instead of actual cluster summaries are presented to more simply illustrate the Scatter/Gather process.
In the example shown in FIG. 9, a user's information need is to determine generally what happened in August 1990. It would be difficult to construct a word-based query effectively representing this information need, because no specific topic description exists. The user might consider general topics, such as "international events," but that topic description would not be effective because articles concerning international events typically do not use those words.
With Scatter/Gather, rather than being forced to provide certain terms, a user is presented with a set of cluster summaries--an outline of the collection. The user need only select those clusters that seem potentially relevant to the topic of interest. In the Scatter/Gather process shown in FIG. 9, the major stories of the month are immediately obvious from the initial scattering: Iraq invades Kuwait, and Germany considers reunification. This leads a user to focus on international issues, selecting the "Iraq," "Germany," and "Oil" clusters. These three clusters are gathered together to form a smaller focus set.
This smaller focus set is then clustered or scattered to produce eight new clusters covering the reduced collection. Because the reduced collection contains only a subset of the articles, these new clusters reveal a finer level of detail than the original eight clusters. The articles on the Iraqi invasion and some of the oil articles have now been separated into clusters discussing the U.S. military deployment, the effects of the invasion upon the oil market, and hostages in Kuwait.
If the user adequately understands these major stories, but wishes to discover what happened in other parts of the world, the user can, for example, select the "Pakistan" cluster--which also contains other foreign political stories--and a cluster containing articles about Africa. Scattering these clusters reveals a number of specific international situations as well as a small collection of miscellaneous international articles. The user thus learns of a coup in Pakistan and about hostages being taken in Trinidad, stories otherwise lost among the more major stories of that month.
FIG. 10 shows a further illustration of Scatter/Gather in operation. In the example shown in FIG. 10, the text collection (or focus set) 20 is an online-version of Grolier's encyclopedia. Each of the twenty-seven thousand articles in the focus set is treated as a separate document. In the example shown in FIG. 10, the user is interested in investigating the role of women in the exploration of space. Rather than attempting to express this information need as a formal query, the user is instead presented with a number of top-level clusters 22A-22I that, from their description, seem relevant to the topic of interest. The user then selects the MILITARY HISTORY cluster 22A, the SCIENCE AND INDUSTRY detector 22C and the AMERICAN SOCIETY cluster 22H to form a reduced corpus (or focus set) 24 of the indicated subset of articles from Grolier's.
The reduced corpus is then reclustered on the fly to produce a new set of clusters 26A-26J covering the reduced corpus 24. Since the reduced corpus contains a subset of the articles in Grolier's, these new clusters are at a finer level of detail than the top-level clusters 22A-22I. The user again selects clusters of interest. In this case, these include the AVIATION cluster 26E, the ENGINEERING cluster 26G and the PHYSICS cluster 26H. Again, a further reduced corpus 28 is formed and reclustered. The final set of clusters 30A-30F includes a MILITARY AVIATION cluster 30A, an APOLLO PROGRAM cluster 30B, an AEROSPACE INDUSTRY cluster 30C, a WEATHER cluster 30D, an ASTRONOMY cluster 30E and a CIVIL AVIATION cluster 30F. At this stage, the clusters are small enough for direct perusal via an exhaustive list of article titles. Assuming at least one article of interest is found, the user may find more articles of a similar nature in the same cluster or may use a directed search method, possibly based on the vocabulary of the located article or of the cluster description, to find additional articles.
Previous work in document clustering includes linear-time procedures, such as those described in Scatter/Gather and the 650 patent, to reduce the time required for clustering to only a few minutes. This is fast enough to search moderately large collections using broad word-based queries. For example, a rate of approximately 3000 documents per minute may be achieved on a Sun Microsystems SPARCSTATION 2 using Scatter/Gather. Even linear-time clustering, however, is too slow to support interactive browsing of very large document collections. This is particularly apparent when one considers applying Scatter/Gather to the TIPSTER collection, a DARPA standard for text retrieval evaluation containing about 750,000 documents. At 3000 documents per minute, this requires over four hours to scatter--far too long to be considered interactive. Thus, faster and more efficient ways to cluster documents must be found.