1. Field of the Invention
This invention relates to document analysis, and in particular to a system and method for automatic and unsupervised clustering of unformatted text documents.
2. Related Art
In the realm of data management, “clustering” is the grouping of documents into related and similar buckets (“clusters”) for ease of processing and reviewing. Unsupervised clustering (i.e., clustering without user control or intervention) is becoming a required feature when processing search results, for example in the legal compliance and discovery environment.
There are many conventional text clustering algorithms.
For example, http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html describes a K-Means algorithm which requires that the number of clusters be defined a priori. For this reason, the K-Means algorithm is not very useful when trying to discover unknown patterns, which by definition will require an unknown number of clusters.
Another example of a clustering algorithm is described at http://ksi.cpsc.ucalgary.ca/AIKM97/ohta/yuiko.html, where there is described a clustering algorithm that builds a concept hierarchy based on multiple-inheritances. This multiple-inheritance structure can result in exponentially increasing numbers of required document comparisons, which can make the technique computationally very expensive.
At http://www.statsoft.com/textbook/stfacan.html other methods of reducing the large vector space using techniques such as Principal Component Analysis and Latent Semantic Analysis are identified. These techniques are based on reducing the number of “terms” in the vector by for example, reducing the words to their root (or stem), or by finding the principal components of a vector by statistical factor analysis. However, the techniques do nothing to reduce the actual number of vector comparisons when clustering, and therefore remain computationally expensive.
Papers at http://www.resample.com/xlminer/help/HClst/HClst_intro.htm and a at http://ercolino.isti.cnr.it/mirco/papers/pakdd2005-Nanni-LNAI-version.pdf describe the standard and optimized HAC algorithm. The complexity of standard HAC algorithm is [N2(LogN)], with improvements of N2 to [N LogN]. The second of these papers suggests using approximated values for calculating document similarities based on the principle that if A is similar to B and B is similar to C, then A is also similar to C. In which case A is given a similarity value to C based on its similarity to B. The approach may not be particularly accurate since it assumes that the similarity between vectors is a transitive property of a domain (which may not be the case).
At http://www.ics.uci.edu/.about.eppstein/projects/pairs/Talks/ClusterGroup.pdf different methods of speeding clustering algorithms are described.
For many of the reasons described above, conventional unsupervised clustering can typically only process up to a few hundred documents before requiring large amounts of memory and/or processor time, or requiring multiple parallel computers. It would therefore be desirable to provide an approach to unsupervised clustering that can scale up to hundreds of thousands of search result documents for use in, for example, legal compliance and discovery applications.