Businesses routinely encounter large collections of documents. For example, companies routinely receive feedback, suggestions, grievances from customers via survey responses, and the like. There is value in understanding the important issues raised in such document collections. For example, a business may wish to quickly ascertain important issues raised in customer feedback comments to improve the business.
Given a large collection of documents, for example a collection of email documents, clustering enables a high level understanding of the significant concepts, issues or topics mentioned in the documents. Most clustering approaches are based on clustering unigrams (a unigram is a single a word) based on the unigrams' context, which is in turn formed by the other unigrams occurring around them in the documents. Clustering based on unigrams, however, has significant limitations like low interpretability.