1. Field of the Invention
This invention relates to document clustering, and particularly to document clustering based on cohesion terms.
2. Description of Background
Before our invention, businesses have systematically increased the leverage gained from enterprise data through technologies such as relational database management systems and techniques such as data warehousing. Additionally, it is conjectured that the amount of knowledge encoded in electronic text far surpasses that available in data alone. However, the ability to take advantage of this wealth of knowledge is just beginning to meet the challenge. One important step in achieving this potential has been to structure the inherently unstructured information in meaningful ways. A well-established first step in gaining understanding is to segment examples into meaningful categories.
Previous attempts to automatically create categorizations in unstructured data have relied on algorithms created for structured data sets. Such approaches convert text examples into numeric vectors of features, sometimes using latent semantic indexing and principle component analysis to reduce dimensionality, and then cluster the data using well-established clustering techniques such as k-means or Expectation Maximization (EM). These approaches attempt to maximize intra-cluster similarity while minimizing inter-cluster similarity.
The problem with approaches of this kind is that they often produce categories that are inexplicable to human interpretation. The fact that a group of documents shares a degree of similarity across an artificial feature space does not insure that the documents in that category taken together construct an easily understood concept. This has led to the problem of cluster naming, to which no practical solution has been found.