The present invention relates to a document-clustering-based browsing procedure for a corpus of documents, which is applicable over all natural languages that contain a lexical analysis capability.
Document clustering has been extensively investigated as methodology for improving document search and retrieval. The general assumption is that mutually similar documents will tend to be relevant to the same queries and hence, automatic determination of groups of such documents can improve recall by effectively broadening a search request. Typically a fixed corpus of documents is clustered either into an exhaustive partition, disjoint or otherwise, or into an hierarchical tree structure. In the case of a partition, queries are matched against clusters, and the contents of some number of the best scoring clusters are returned as a result, possibly sorted by score. In the case of a hierarchy, queries are processed downward, always taking the highest scoring branch, until some stopping condition is achieved. The subtree at that point is then returned as a result.
Hybrid strategies are also available, which are essentially variations of near-neighbor search, where nearness is defined in terms of the pairwise document similarity measure used to generate the clustering. Indeed, cluster search techniques are typically compared to similarity search, a direct near-neighbor search, and are evaluated in terms of precision and recall. Various studies have indicated that cluster search strategies are not markedly superior to similarity search and, in some situations, can be inferior. It is therefore not surprising that cluster search, given its indifferent performance, and the high determinable cost of clustering large corpora, has not gained wide popularity.
Document clustering has also been studied as a method for accelerating similarity search, but the development of fast procedures for near-neighbor searching has decreased interest in that possibility.
In order to cluster documents, one must first establish a pairwise measure of document similarity and then define a method for using that measure to form sets of similar documents, or clusters. Numerous document similarity measures have been proposed, all of which consider the degree of word overlap between the two documents of interest, described as sets of words, often with frequency information. These sets are typically represented as sparse vectors of length equal to the number of unique words (or types) in the corpus. If a word occurs in a document, its location in this vector is occupied by some positive value (one if only presence/absence information is considered, or some function of its frequency within that document if frequency is considered). If a word does not occur in a document, its location in this vector is occupied by zero. A popular similarity measure, the cosine measure, determines the cosine of the angle between these two sparse vectors. If both document vectors are normalized to unit length, this is of course, simply the inner product of the two vectors. Other measures include the Dice and Jaccard coefficient, which are normalized word overlap counts. It has also been suggested that the choice of similarity measure has less qualitative impact on clustering results than the choice of clustering procedure.
A wide range of clustering procedures have been applied to documents including, most prominently, single-linkage hierarchical clustering. Hierarchical clustering procedures proceed by iteratively considering all pairs of similarities, and fusing the pair which exhibits the greatest similarity. They differ in the procedure used to determine similarity when one of the pairs is a document group, i.e., the product of a previous fusion. Single-linkage clustering defines the similarity as the maximum similarity between any two individuals, one from each half of the pair. Alternative methods consider the minimum similarity (complete linkage), the average similarity (group average linkage), as well as other aggregate measures. Although single-linkage clustering is known to have an unfortunate chaining behavior, typically forming elongated straggly clusters, it continues to be popular due to its simplicity, and the availability of an optimal space/time procedure for its determination.
Standard hierarchical document clustering techniques employ a document similarity measure and consider the similarities of all pairs of documents in a given corpus. Typically, the most similar pair is fused and the process iterated, after suitably extending the similarity measure to operate on agglomerations of documents as well as individual documents. The final output is a binary tree structure that records the nested sequence of pairwise joints. Traditionally, the resulting trees had been used to improve the efficiency of standard boolean or relevance searches by grouping together similar documents for rapid access. The resulting trees have also lead to the notion of cluster search in which a query is matched directly against nodes in the cluster tree and the best matching subtree is returned. Counting all pairs, the cost of constructing the cluster trees can be no less than proportional to N.sup.2, where N is the number of documents in the corpus. Although clustering experiments have been conducted on corpora with documents numbering in the low tens of thousands, the intrinsic order of these clustering procedures works against the expectation that corpora will continue to increase in size. Similarly, although cluster searching has shown some promising results, the method tends to favor the most determinationally expensive similarity measures and seldom yields greatly increased performance over other standard methods.
Hierarchical methods are intrinsically quadratic in the number of documents to be clustered, because all pairs of similarities must be considered. This sharply limits their usefulness, even given procedures that attain this theoretical upper bound on performance. Partitional strategies (those that strive for a flat decomposition of the collection into sets of documents rather than a hierarchy of nested partitions) by contrast are typically rectangular in the size of the partition and the number of documents to be clustered. Generally, these procedures proceed by choosing in some manner, a number of seeds equal to the desired size (number of sets) of the final partition. Each document in the collection is then assigned to the closest seed. As a refinement the procedure can be iterated with, at each stage, a hopefully improved selection of cluster seeds. However, to be useful for cluster search the partition must be fairly fine, since it is desirable for each set to only contain a few documents. For example, a partition can be generated whose size is related to the number of unique words in the document collection. From this perspective, the potential determinable benefits of a partitional strategy are largely obviated by the large size (relative to the number of documents) of the required partition. For this reason partitional strategies have not been aggressively pursued by the information retrieval community.
The standard formulation of cluster search presumes a query, the user's expression of an information need. The task is then to search the collection of documents that match this need. However, it is not difficult to imagine a situation in which it is hard, if not impossible to formulate such a query. For example, the user may not be familiar with the vocabulary appropriate for describing a topic of interest, or may not wish to commit himself to a particular choice of words. Indeed, the user may not be looking for anything specific at all, but rather may wish to gain an appreciation for the general information content of the collection. It seems appropriate to describe this as browsing rather than search, since it is at one extreme of a spectrum of possible information access situations, ranging from requests for specific documents to broad, open-ended questions with a variety of possible answers. Standard information access techniques tend to emphasize search. This is especially clearly seen in cluster search where a technology capable of topic extraction, i.e., clustering, is submerged from view and used only as an assist for near-neighbor searching.
In proposing an alternative application for clustering in information access we take our inspiration from the access methods typically provided with a conventional text book. If one has a specific question in mind, and specific terms which define that question, one consults an index, which directs one to passages of interest, keyed by search words. However, if one is simply interested in gaining an overview, one can turn to the table of contents which lays out the logical structure of the text for perusal. The table of contents gives one a sense of the types of questions that might be answered if a more intensive examination of the text were attempted, and may also lead to specific sections of interest. One can easily alternate between browsing the table of contents, and searching the index.
By direct analogy, an information access system is proposed herein, which can have, for example, two components: a browsing tool which uses a cluster-based, dynamic table-of-contents metaphor for navigating a collection of documents; and one or more word-based, directed text search tools, such as similarity search, or the search technique described in U.S. patent application Ser. No. 07/745,794 to Jan O. Pedersen et al filed Aug. 16, 1991, and entitled An Iterative Technique For Phrase Query Formation and an Information Retrieval System Employing Same. The browsing tool describes groups of similar documents, one or more of which can be selected for further refinement. This selection/refinement process can be iterated until the user is directly viewing individual documents. Based on documents found in this process, or on terms used to describe document groups, the user may at any time switch to a more focused search method. In particular it is anticipated that the browsing tool will not necessarily be used to find particular documents, but may instead assist the user in formulating a search request, which will then be evaluated by some other means.
U.S. Pat. No. 4,956,774 to Shibamiya et al. discloses a method for selecting an access path in a relational database management system having at least one index. The first step is to select a number of most frequently occurring values of at least part of a key of the index. The number is greater than zero and less than the total number of such values. Statistics on the frequency of occurrence of the selected values are collected. An estimate of the time required to use the index as the access path is made, based at least in part on the index's most frequently occurring values statistics. The estimate is used as the basis at least in part for selecting an access path for the query. The database optimizer described is hierarchically organized in order of word frequency.
"Recent trends in hierarchic document clustering: A critical review" by Peter Willett, Information Processing of Management, Vol. 24, No. 5, pages 577-97 (1988--printed in Great Britain) describes the calculation of interdocument similarities and clustering methods that are appropriate for document clustering. The article further discusses procedures that can be used to allow the implementation of the aforementioned methods on databases of nontrivial size. The validation of document hierarchies is described using tests based on the theory of random graphs and on empirical characteristics of document collections that are to be clustered. A range of search strategies is available for retrieval from document hierarchies and the results are presented in a series of research projects that have used these strategies to search a cluster resulting from several different types of hierarchic agglomerative clustering methods. The article suggests that a complete linkage method is probably the most effective method in terms of retrieval performance; however, it is also difficult to implement in an efficient manner. Other applications of document clustering techniques are discussed briefly; experimental evidence suggests that nearest neighbor clusters, possibly represented as a network model, provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.
"Understanding Multi-Articled Documents" by Tsujimoto et al., presented in June 1990 in Atlantic City, N.J. at the 10th International Conference for Pattern Recognition, describes an attempt to build a method to understand document layouts without the assistance of character recognition results, i.e., the meaning of contents. It is shown that documents have an obvious hierarchical structure in their geometry which is represented by a tree. A small number of rules are introduced to transform the geometric structure into the logical structure which represents the semantics carried by the documents. A virtual field separator technique is employed to utilize information carried by a special constituent of documents such as field separators and frames, keeping the number of transformation rules small.