Most academics and numerous others routinely attempt to discover useful information by reading large quantities of unstructured text. The corpus of text under study may be literature to review, news stories to understand, medical information to decipher, blog posts, comments, product reviews, or emails to sort, audio-to-text summaries of speeches to comprehend. The purpose is to discover useful information from this array of unstructured text. This is a time-consuming task and the information is increasing at a very fast rate, with the quantity of text equivalent to that in Library of Congress being produced in emails alone every ten minutes.
An essential part of information discovery from unstructured text involves some type of classification. However, classifying documents in an optimal way is an extremely challenging computational task that no human being can come close to optimizing by hand. The task involves choosing the “best” (by some definition) among all possible ways of partitioning a set of n objects (which mathematically is known as the Bell number). The task may sound simple, but merely enumerating the possibilities is essentially impossible for even moderate numbers of documents. For example, the number of partitions of a set of merely 100 documents is 4.76e+115, which is considerably larger than the estimated number of elementary particles in the universe. Even if the number of partitions is limited, the number is still far beyond human abilities; for example, the number ways of classifying 100 documents into two categories is 6.33e+29.
In addition, the task of optimal classification involves more than enumeration. Classification typically involves assessing the degree of similarity between each pair of documents, and then creating a set of clusters called a “clustering” by simultaneously maximizing the similarity of documents within each cluster and minimizing the similarity of documents across clusters. For 100 documents,
      (                            100                                      2                      )    =      4    ,    950  similarities need to be remembered while sorting documents into categories and simultaneously optimizing across the enormous number of possible clusterings.
This contrasts with a number somewhere between 4 and 7 (or somewhat more, if ordered hierarchically) items a human being can keep in short-term working memory. Various algorithms to simplify this process are still extremely onerous and are likely to lead to sacrificing rather than optimizing. In addition, this process assumes that humans can reliably assess the similarity between documents, which is probably unrealistically optimistic given that the ordering of the categories, the ordering of the documents, and variations in human coder training typically prime human coders to respond in different ways. In practice, inter-coder reliability even for well-trained human coders classifying documents into given categories is rarely very high.
Since a crucial component of human conceptualization involves classifying objects into smaller numbers of easier-to-comprehend categories, an expansive literature in biology, computer science, statistics, and the social sciences has arisen to respond to this challenge. The literature is focused on fully automatic clustering (FAC) algorithms designed to produce insightful partitions of input objects with minimal human input. At least 150 such FAC algorithms have been characterized in the literature. Each of these methods work well in some data sets, but predicting which, if any, method will work well for a given application is often difficult or impossible, and none work well across applications.
Other articles disclose computer assisted clustering (CAC) methods designed to give a human user help in finding an insightful or useful conceptualization from a choice of clusterings. The intended trade-off means that CAC methods require an investment of more user time relative to FAC methods in return for better, more insightful, clusterings. However, CAC methods, in turn, require considerably less user time than completely unassisted human clustering. For example, in an article entitled “A General Purpose Computer-Assisted Document Clustering Methodology.” J. Grimmer and G. King, 2010, a disclosed CAC method applies a large set of FAC methods to a data set and scales the resulting clusterings so they are each represented by a point in two-dimensional space, with points closer together representing clusterings that are more similar. These points are then used as basis partitions to construct millions of new clusterings. A method is defined for identifying new clusterings in the two dimensional space, through the creation of local averages of the clusterings from the statistical model. In this way, every point in the space defines a clustering. This space is then graphically displayed and a user can move a cursor around the space and (in an accompanying display window) watch one clustering morph into another. This CAC method was designed to help users quickly and efficiently choose clusterings that they, and others, found more insightful or useful than clusterings created by existing FAC methods or by following traditional approaches without computer assistance.
However, in this CAC method clusterings produced by all existing FAC methods comprise only a small portion of the possible clusterings. Since these clusterings are used to construct the clustering space that can be explored by the user, the aforementioned CAC method inherently limits the clustering space and omits many clusterings.