Methods of searching or browsing a corpus of documents that involve repeated choice between a number of alternatives--each set of possible choices contained within that alternative selected at the previous stage of choice--suffer from a common difficulty. Once an incorrect choice is made, there is no way to recover. Wrong choices will be most frequent when the subset of documents being sought lie close to a boundary between one choice and another. The appropriate remedy to this problem is to arrange for decisions regarding the selection of choices not to be incorrect.
Since the choice is among bundles of documents, it is both convenient and suitable to refer to those bundles as clusters. Each cluster in the first stage of choices will be comprised of the documents belonging to the set of second stage clusters that correspond to it. And each second stage cluster will comprise corresponding third stage clusters, the subdivision continuing until the n.sup.th stage clusters are small enough to allow attention to individual documents.
Such procedures of stagewise choice have been used most frequently when access is based, interactively, on the user's judgment. Limitations of display methods and or the user's short-term memory make it infeasible to go at once to the many last-stage clusters. The difficulty arising from mistaken choices when what is sought falls near a division between clusters is often addressed by allowing the user to choose two or more clusters in indecisive situations. This leads to the proliferation of paths unless, as illustrated by the scatter-gather method taught in U.S. Pat. No. 5,442,778 to Pedersen et al., the clustering is always done "on the fly" at each stage of choice. This ameliorates the difficulty near the margins, but enforces an increase in the number of stages because of repeated doublings. The present invention attacks the previously noted difficulty more efficiently by planning for overlap at the margins--so that every cluster is moderately larger than a cluster from a corresponding set of disjoint (i.e., non-overlapping) clusters would be.
Stage-by-stage choice has not been commonly used in search methods that rely on a noninteractive specification of a query which is compared with whatever clusters are relevant. The costs due to taking account of the marginality problem have outweighed the reduced computational load that would be associated with a stage-by-stage approach. As a result, query-based systems usually rely on comparisons to the query with either every smallest cluster or even, most extremely, with each document. Either clearly avoids the marginality problem but at the cost of much more extensive computation. Here again overlapping clusters, where marginal cases belong to two or more clusters at each specific stage, can reduce the marginality problem, while preserving most of the computational savings.
Accordingly, the present invention is directed to improving the performance of information access methods and apparatus as the result of the use of non-disjoint (overlapping) clustering operations. Document clustering has been extensively investigated for improving document search and retrieval methods. In general, clustering relies on the fact that mutually similar documents will tend to be relevant to the same queries, hence, automatic determination of clusters (sets) of such documents can improve recall by effectively broadening a search request. Typically a fixed corpus of documents is clustered either into an exhaustive partition, e.g., disjoint, or into a hierarchical tree structure. In the case of a partition, queries are matched against clusters, and the contents of some number of the best scoring clusters are returned as a result, possibly sorted by score. In the case of a hierarchy, queries are processed downward, always taking the highest scoring branch, until some stopping condition is achieved. The subtree at that point is then returned as a result.
Hybrid strategies are also available, which are essentially variations of near-neighbor searching, where nearness is defined in terms of the pairwise document similarity measure used for clustering. Indeed, cluster search techniques are typically compared to similarity search, a direct near-neighbor search, and are evaluated in terms of precision and recall, as described by G. Salton and M. J. McGill in "Introduction to Modern Information Retrieval," McGraw-Hill, 1983. Also noted is G. Salton's "Automatic Text Processing," Addison-Wesley, 1989.
In order to cluster documents, it is necessary to first establish a pairwise measure of document similarity and then define a method for using that measure to form sets of similar documents, or clusters. Numerous document similarity measures have been proposed, all of which consider the degree of word overlap between the two documents of interest, described as sets of words, often with frequency information. These sets are typically represented as sparse vectors of length equal to the number of unique words (or types) in the corpus. If a word occurs in a document, its location in this vector is occupied by some positive value (one if only presence/absence information is considered, or some function of its frequency within that document if frequency is considered). If a word does not occur in a document, its location in this vector is occupied by zero. A popular similarity measure, the cosine measure, determines the cosine of the angle between two sparse vectors. If both document vectors are normalized to unit length, this is of course, simply the inner product of the two vectors. Other measures include the Dice and Jaccard coefficient, which are normalized word overlap counts. It is suggested that the choice of similarity measure has less qualitative impact on clustering results than the choice of clustering procedure. Accordingly, the present invention focuses on the method by which clusters are generated and does not rely on a particular similarity measure. Words are often replaced by terms, in which gentle stemming has combined words differing only by simple suffixes, and words on a stop list are omitted.
Standard hierarchical document clustering techniques employ a document similarity measure and consider the similarities of all pairs of documents in a given corpus. Typically, the most similar pair is fused and the process iterated, after suitably extending the similarity measure to operate on agglomerations of documents as well as individual documents. The final output is a binary tree structure that records the nested sequence of pairwise joints. Traditionally, the resulting trees had been used to improve the efficiency of standard Boolean or relevance searches by grouping together similar documents for rapid access. The resulting trees have also lead to the notion of cluster search in which a query is matched directly against nodes in the cluster tree and the best matching subtree is returned. Counting all pairs, the cost of constructing the cluster trees can be no less than proportional to N.sup.2, where N is the number of documents in the corpus. Although cluster searching has shown some promising results, the method tends to favor the most determinationally expensive similarity measures and seldom yields greatly increased performance over other standard methods.
One stage methods are intrinsically quadratic in the number of documents to be clustered, because all pairs of similarities must be considered. This sharply limits their usefulness, even given procedures that attain this theoretical upper bound on performance. Partitional strategies (those that strive for a flat decomposition of the collection into sets of documents rather than a hierarchy of nested partitions) by contrast are typically rectangular in the size of the partition and the number of documents to be clustered. Generally, these procedures proceed by choosing in some manner, a number of seeds equal to the desired size (number of sets) of the final partition. Each document in the collection is then assigned to the closest seed. As a refinement the procedure can be iterated with, at each stage, a hopefully improved selection of cluster seeds. However, to be useful for cluster search the partition must be fairly fine, since it is desirable for each set to only contain a few documents. For example, a partition can be generated whose size is related to the number of unique words in the document collection. Accordingly, the potential benefits of a partitional strategy are largely obviated by the large size (relative to the number of documents) of the required partition. For this reason partitional strategies have not been aggressively pursued by the information retrieval community.
The standard cluster search presumes a query, the user's expression of an information need. The task is then to search the collection of documents that are identified as matching this need. However, it is not difficult to imagine a situation in which it is hard, if not impossible to formulate such a query, or where the results of the query are voluminous. One merely has to consider an exemplary search on the Internet, and the potential for voluminous results, to gain an immediate appreciation for clustering-browsing functionality. As another example, the user may not be familiar with the vocabulary appropriate for describing a topic of interest, or may not wish to commit to a particular choice of words. Indeed, the user may not be looking for anything specific at all, but rather may wish to gain an appreciation for the general information content of the collection. It seems appropriate to describe this as browsing, since it is at one extreme of a spectrum of possible information access situations, including open-ended questions with a variety of possible answers.
In proposing an alternative application for clustering in information access the present invention is based upon methods typically provided with a conventional text book. If one has a specific question in mind, and specific terms which define that question, one consults an index, which directs one to passages of interest, keyed by search words. However, if one is simply interested in gaining an overview, one can turn to the table of contents which lays out the logical structure of the text for perusal. The table of contents gives one a sense of the types of questions that might be answered if a more intensive examination of the text were attempted, and may also lead to specific sections of interest. One can easily alternate between browsing the table of contents, and searching the index or, more importantly, an iterative combination of both.
Heretofore, publications have disclosed clustering techniques, the relevant portions of which may be briefly summarized as follows:
U.S. Pat. No. 5,442,778 to Pedersen et al., issued Aug. 15, 1995, for a "Scatter-Gather: A Cluster-Based Method and Apparatus for Browsing Large Document Collections," Pedersen et al., hereby incorporated by reference for its teachings, discloses a document clustering-based browsing procedure for a corpus of documents. The methods described for partitional clustering include a Buckshot method, a Fractionation method, both of which may be employed to produce input for a cluster digest method for determining a summary of the ordering of a corpus of documents in the Scatter-Gather technique.
"Recent trends in hierarchic document clustering: A critical review" by Peter Willett, Information Processing of Management, Vol. 24, No. 5, pages 577-97 (1988--printed in Great Britain), describes the calculation of interdocument similarities and clustering methods that are appropriate for document clustering.
"Understanding Multi-Articled Documents" by Tsujimoto et al., presented in June 1990 in Atlantic City, N.J. at the 10th International Conference for Pattern Recognition, describes an attempt to build a method to understand document layouts without the assistance of character recognition results, i.e., the meaning of contents.
P. Willett, in "Document Clustering Using an Inverted File Approach," Journal of Information Science, Vol. 2 (1980), pp. 223-31, teaches a method for generating overlapping document clusters.
As will be appreciated, various information access techniques use subdivision of the initial corpus, or one of its subcorpora, into clusters--often with the purpose of seeking the user's aid in selecting one or more clusters to serve as a subcorpus for a subsequent iterative stage. Conventionally, these clusters are selected so that (a) their union covers the whole of the initial corpus, and (b) the individual clusters are disjoint (non-overlapping). Unfortunately, disjoint clusters have practical disadvantages when the document that is sought falls near, or even across, a cluster boundary, so that at least two parallel clusters must be selected to avoid the loss of the document. The present invention, however, avoids the need for such parallelism and allows the user access to clusters that overlap so as to make choosing a single cluster both natural and efficient.
In accordance with the present invention, there is provided a method, operating in a digital computer, for searching a corpus of documents, comprising the steps of: preparing an initial structuring of the corpus into a plurality of overlapping clusters, wherein at least two of the plurality of overlapping clusters contain a single document; and determining a summary of the plurality of clusters prepared by said initial structuring of the corpus.
In accordance with another aspect of the present invention, there is provided a document browsing system for use with a corpus of documents stored in a computer system, the document browsing system comprising: program memory for storing executable program code therein; a processor, operating in response to the executable program stored in said program memory, for automatically preparing a structuring of the corpus of documents into a plurality of document clusters, wherein at least two of the plurality of document clusters overlap and contain at least one common document therebetween; data memory for storing data identifying the documents associated with each of the plurality of document clusters; processor summarizing the plurality of document clusters and generating summary data for said document clusters; and a user interface for displaying the summary data.
To provide the flexibility required to deal with nonspecific user's requirements, a browsing system usually requires means for broadening the working corpus as well as narrowing it. This invention preferably concerns the narrowing aspect and its description assumes tacitly the existence of broadening operations.
In accordance with yet another aspect of the present invention, there is provided a document search and retrieval method, operating in a digital computer, for searching a corpus of documents, comprising the steps of: identifying, in response to at least one search term, a sub-corpus of documents containing the at least one user specified search term; preparing an initial structuring of the sub-corpus into a plurality of overlapping clusters, wherein at least two of the plurality of overlapping clusters contain a single document; and determining a summary of the plurality of overlapping clusters prepared by said initial structuring of the sub-corpus.
In accordance with a further aspect of the present invention, there is provided a document searching system for use with a corpus of documents stored in a computer system, the document searching system comprising: program memory for storing executable program code therein; a processor, operating in response to the executable program stored in said program memory, for automatically preparing an structuring of the corpus of documents into a plurality of document clusters, wherein at least two of the plurality of document clusters overlap and contain at least one common document therebetween; data memory for storing data identifying the documents associated with each of the plurality of document clusters; memory access means for accessing the data memory and said processor summarizing the plurality of document clusters and generating summary data for said document clusters; and a user interface for displaying the summary data.
In accordance with yet another aspect of the present invention, there is provided a method, operating in a digital computer, for searching a corpus of documents, comprising the steps of: subdividing a corpus of documents into a hierarchical structure containing a plurality of levels of clusters, wherein at least two of the clusters on a particular level are overlapping clusters containing at least a single document in common; selecting, from the hierarchical structure, a plurality of clusters to form a subcorpus, wherein the subcorpus contains fewer document than the corpus; and identifying, in response to a search query, those documents in the subcorpus providing a positive response to the search query.
One aspect of the invention is based on the observation of problems with conventional document search and retrieval techniques--disjoint clustering--where a user can select only one cluster in order to obtain a particular document.
This aspect is based on the discovery of a technique that alleviates these problems by allowing documents within the corpus to be associated with a plurality of clusters, where such a technique would be characterized as having overlapping clusters. This technique can be implemented, for example, by clustering related documents into non-disjoint clusters. Here documents only moderately related to a particular attractor, or cluster vector, will also be moderately related to another attractor and will, therefore, be associated with both attractors (overlap). Thus, it is believed that this aspect of the invention not only favors recall, but may ultimately favor precision as well. Precision is favored because the present invention allows the user to initially review a broader range of documents and to subsequently focus on documents belonging only to a single inner cluster and to no other clusters.
A processor or computing machine implementing the invention can include a monitor or display to assist the user in the visualization of the clustering operation so as to allow "browsing" of the corpus in an orderly fashion. Such a display preferably shows the results of a query in a clustered format to enable the user to iteratively review documents within a corpus that relate to a desired topic.