Word clouds exist to aid in the problem of comprehending extremely large and complex masses of information concerning the words they display. Examples abound of sets of data of sufficient complexity to be unimaginable in their entirety. Human intuition works very well when comparing two or three categories of data; for example, the relationship between height and weight in a population, which can be depicted with a two-dimensional Cartesian graph, or the relationship between height, weight, and longevity, which requires a three-dimensional graph; it may be hard to draw, but it is not particularly taxing to the imagination. However, it is quite common to encounter a database table that contains a hundred or more columns, each recording a different category of facts about the subject of study. To depict every relationship between all of the categories of data in a hundred-columned table would require a hundred-dimensional graph, or a graph with not only an x-axis, a y-axis, and a z-axis, but 97 other axes; even if there were a way to draw such a graph in its entirety, the result would be totally incomprehensible to even the most adept mind at spatial reasoning.
One way to respond to this level of complexity is to project the multiplicity of dimensions onto a two or three-dimensional space, in such a way as to preserve some aspect of the relationship between the many dimensions. The two or three dimensional space in itself may be depicted in two or three dimensions, and thus is amenable to human comprehension. By way of analogy, imagine a light shining on a cube so that its shadow is cast on a sheet of paper. When the light is shone on a single face of the cube, and the paper parallels the cube's opposite face, the shadow of the cube on the paper, or its two-dimensional projection, would be a square. If the paper faced one corner of the cube and the light shone upon the opposite vertex, the projection might appear hexagonal. A person viewing a series of such projections, while understanding how they were produced, could use them to deduce the shape of the overall cube.
Of course, to analyze a cube in this way would be unnecessary, but one way to explore a four-dimensional “cube,” defined as a four-dimensional polyhedron in which all edges are the same length and join orthogonally, is to view a series of three-dimensional figures that represent projections of the hypercube onto three dimensions. The same approach could be used to study a 100-dimensional “cube.” Likewise, while it is impossible to depict the relationship between 100 categories of data in a single comprehensible drawing, it is possible to draw relationships between any two or three categories in the set. Similarly, one may use this approach to depict relationships between two or three combinations of categories. For instance, in a data set made up of physical and demographic data concerning a group of people, one could graph height to weight ratio against blood pressure, which would be a projection of the data set onto two dimensions, one of which was a combination of two dimensions in the data set. To explore the relationships between other categories in the data set would require a different projection. The foregoing example is extremely simple. The selection of a projection, and the number of relationships that the projection can produce, can be far subtler and involve much more sophisticated mathematics.
Once the data has been projected onto a manageable set of dimensions, the challenge is to portray the information in that projection in a way that is intuitively meaningful to a person viewing the depiction. One efficient way to depict a three-dimensional set of relationships between categories of data, particularly data pertaining to texts, is with a “word cloud.” A word cloud is a kind of weighted list in which a set of words, often words taken from a particular text, is displayed as on a page, and in which at least the font size of each depicted word varies depending on some attribute concerning the word. For example, the positions of the words in the word cloud could be determined by alphabetical order, and the sizes of the words in the word cloud could depend on how frequently each word appears in a text. The positions of the words in the cloud could also be determined by size of the words, by aesthetic considerations, or by further information about relationships between the words that the designer of the word cloud wished to convey. Some word clouds use colors as well as numbers to show something about the words; a bi-chromatic word cloud, for example, could allow the viewer to see which person in a dialogue uttered a given word. A more subtle use of coloring is exemplified by collocate clouds, which use shades of color to depict how frequently a given word appears only with the word the user has provided, while the size of a given word indicates how frequently the word appears in a text within a given distance of the word provided by the user. A cloud that uses all of the attributes mentioned above to display some piece of information about a word can depict a surprising amount of data in an intuitively clear way, if properly designed.
It is hardly surprising that word clouds, which necessarily involve a lot of searching, sorting, and computation involving large volumes of data, are generally created by software, and are most often encountered in internet applications. There are many software programs available on the web or on stand-alone computers that generate various kinds of word clouds, using the design parameters described above among others. The extant word cloud generation programs have in common a tendency to produce a single view of a given word cloud, reflecting one particularly interesting way of analyzing the textual or other data the program is designed to display. While the results can be fascinating, the programs currently in existence do not permit the user to explore the underlying data sets more fully by customizing the word clouds and manipulating the data projections that form the clouds' internal basis. Thus, there remains a need for a word cloud generating program that fully exploits the word clouds' potential to bring complex data within the reach of intelligent comprehension.