Most academics and numerous others routinely attempt to discover useful information by reading large quantities of unstructured text. The corpus of text under study may be literature to review, news stories to understand, medical information to decipher, blog posts, comments, product reviews, or emails to sort, audio-to-text summaries of speeches to comprehend. The purpose is to discover useful information from this array of unstructured text. This is a time-consuming task and the information is increasing at a very fast rate, with the quantity of text equivalent to that in Library of Congress being produced in emails alone every ten minutes.
An essential part of information discovery from unstructured text involves some type of classification. However, classifying documents in an optimal way is an extremely challenging computational task that no human being can come close to optimizing by hand. The task involves choosing the “best” (by some definition) among all possible ways of partitioning a set of n objects (which mathematically is known as the Bell number). The task may sound simple, but merely enumerating the possibilities is essentially impossible for even moderate numbers of documents. For example, the number of partitions of a set of merely 100 documents is 4.76e+115, which is considerably larger than the estimated number of elementary particles in the universe. Even if the number of partitions is limited, the number is still far beyond human abilities; for example, the number ways of classifying 100 documents into two categories is 6.33e+29.
In addition, the task of optimal classification involves more than enumeration. Classification typically involves assessing the degree of similarity between each pair of documents, and then creating a set of clusters called a “clustering” by simultaneously maximizing the similarity of documents within each cluster and minimizing the similarity of documents across clusters. For 100 documents,
      (                            100                                      2                      )    =      4    ,    950  similarities need to be remembered while sorting documents into categories and simultaneously optimizing across the enormous number of possible clusterings.
This contrasts with a number somewhere between 4 and 7 (or somewhat more, if ordered hierarchically) items a human being can keep in short-term working memory. Various algorithms to simplify this process are still extremely onerous and are likely to lead to sacrificing rather than optimizing. In addition, this process assumes that humans can reliably assess the similarity between documents, which is probably unrealistically optimistic given that the ordering of the categories, the ordering of the documents, and variations in human coder training typically prime human coders to respond in different ways. In practice, inter-coder reliability even for well-trained human coders classifying documents into given categories is rarely very high.
Unfortunately, even fast computers cannot classify, at least not without much forehand knowledge about the substance of the problem to which a particular method is applied. That is, the implicit goal of the prior art—developing a cluster analysis method that works well across applications—is actually known to be impossible due to two theorems. A theorem called the “ugly duckling theorem” holds that, without assumptions, every pair of documents are equally similar and, as a result, every partition of documents is equally similar. Another theorem called the “no free lunch theorem” holds that every possible clustering method performs equally well on average over all possible substantive applications. Thus, any single cluster analysis method can only be optimal with respect to some specific set of substantive problems and type of data set.
Although application-independent clustering is impossible, very little is known about the substantive problems for which existing cluster analysis methods work best. Each of the numerous known cluster analysis methods is justified from a statistical, computational, data analysis, machine learning, or other perspective, but very few are justified in a way that makes it possible to know beforehand the data set with which any one would work well. For example, for a corpus of all blog posts about all candidates during the 2008 U.S. presidential primary season, there are many clustering methods that might work, including model-based approaches, subspace clustering methods, spectral approaches, grid-based methods, graph-based methods, fuzzy k-modes, affinity propagation, self-organizing maps and many others. All these method and many other clustering algorithms are clearly described in the literature, and most have been implemented in available computer code, but very few hints have been given or are known about exactly when any of these methods would work best, well, or better than other methods with this particular data set.
Consider for example, the finite normal mixture clustering model, which is a particularly “principled statistical approach”. This model is easy to understand, has a well-defined likelihood, can be interpreted from a frequentist or Bayesian perspective, and has been extended in a variety of ways. However, the “ugly duckling” and “no free lunch theorems” predict that no one approach, including this one, is universally applicable or optimal across applications. Yet, a search of prior art literature produces no suggestion whether a particular corpus, composed of documents of particular substantive topics, structures, or patterns is likely to reveal its secrets best when analyzed with this method. The method has been applied to various data sets, but it is seemingly impossible to know when it will work before looking at the results in any application. Moreover, finite normal mixtures are among the simplest and, from a statistical perspective, most transparent cluster analysis approaches available; knowing which methods work for most other approaches will likely be even more difficult.
Developing intuition for when specific cluster analysis methods work best might be possible in some special cases, but doing so for most of the rich diversity of available methods seems infeasible. Indeed, this problem occurs in unsupervised learning problems almost by definition, since the goal of the analysis is to discover unknown facts. If it were known beforehand something as specific as the model from which the data were generated (up to some unknown parameters), then the analysis would likely not be at an early discovery stage.