Data is more than the numbers, values, or predicates of which it is comprised. Data resides in multi-dimensional spaces which harbor rich and variegated landscapes that are not only strange and convoluted, but are not readily comprehendible by the human brain. The most complicated data arises from measurements or calculations that depend on many apparently independent variables. Data sets with hundreds of variables arise today in many contexts, including, for example: gene expression data for uncovering the link between the genome and the various proteins for which it codes; demographic and consumer profiling data for capturing underlying sociological and economic trends; sales and marketing data for huge numbers of products in vast and ever-changing marketplaces; and environmental measurements for understanding phenomena such as pollution, meteorological changes and resource impact issues. International research projects such as the Human Genome Project and the Sloan Digital Sky Survey are also forming massive scientific databases. Furthermore, corporations are creating large data warehouses of historical data on key aspects of their operations. Corporations are also using desktop applications to create many small databases for examining specific aspects of their business.
One challenge with any of these databases is the extraction of meaning from the data they contain: to discover structure, find patterns, and derive causal relationships. Often, the sheer size of these data sets complicates this task and means that interactive calculations that require visiting each record are not plausible. It may also be infeasible for an analyst to reason about or view the entire data set at its finest level of detail. Even when the data sets are small, however, their complexity often makes it difficult to glean meaning without aggregating the data or creating simplifying summaries.
Among the principal operations that may be carried out on data, such as regression, clustering, summarization, dependency modeling, and classification, the ability to see patterns rapidly is of paramount importance. Data comes in many forms, and the most appropriate way to display data in one form may not be the best for another. In the past, where it has been recognized that many methods of display are possible, it has been a painstaking exercise to select the most appropriate one. However, identifying the most telling methods of display can be intimately connected to identifying the underlying structure of the data itself.
Business intelligence is one rapidly growing area that benefits considerably from tools for interactive visualization of multi-dimensional databases. A number of approaches to visualizing such information are known in the art. However, although software programs that implement such approaches are useful, they are often unsatisfactory. Such programs have interfaces that require the user to select the most appropriate way to display the information.
Visualization is a powerful tool for exploring large data, both by itself and coupled with data mining algorithms. However, the task of effectively visualizing large databases imposes significant demands on the human-computer interface to the visualization system. The exploratory process is one of hypothesis, experiment, and discovery. The path of exploration is unpredictable, and analysts need to be able to easily change both the data being displayed and its visual representation. Furthermore, the analyst should be able to first reason about the data at a high level of abstraction, and then rapidly drill down to explore data of interest at a greater level of detail. Thus, a good interface both exposes the underlying hierarchical structure of the data and supports rapid refinement of the visualization.
Currently, Tableau's software and Microsoft's Excel are examples of visualization software that create views of datasets. Specifically, Tableau Table Drop allows users to drag data fields onto a Tableau view to create a graphical views. When the view was a text table, the behavior was similar to the drags supported by Excel Pivot Tables. For example, dragging a quantitative data type (Q) onto a text table (X=O Y=O T=Q, where “O” stands for ordinal data), would extend the table to put the two measures next to each other (X=O Y=O,Om T=Qm, where “Om” stands for measure ordinal data and “Qm” stands for measure quantitative data). However, Tableu's Table Drop has functionality not found in Excel's Pivot Tables in that it may change the view type of a view when fields are dragged onto the view. For example, dragging a Q onto a bar chart (X=O Y=Q) created a stacked bar chart (X=O Y=Qm C=Om). Or, if there was already a field with a color encoding (X=O Y=Q C=F) in the view, then the software would transform the Q data into Qm data, and would place the measure names on the Level of Detail encoding (X=O Y=Qm C=F L=Om). With scatter plots, the logic was similar, except the transformation of Q to Qm and placement of measure names on the Level of Detail encoding would be triggered if an existing field already had a shape encoding.
In addition to various software programs, the known art further provides formal graphical presentations. Bertin's Semiology of Graphics, University of Wisconsin Press, Madison Wis. (1983), is an early attempt at formalizing graphic techniques. Bertin developed a vocabulary for describing data and techniques for encoding the data into a graphic. Bertin identified retinal variables (position, color, size, etc.) in which data can be encoded. Cleveland (The Elements of Graphing Data, Wadsworth Advanced Books and Software, (1985), Pacific Grove, Calif. and Visualizing Data, (1993), Hobart Press) used theoretical and experimental results to determine how well people can use these different retinal properties to compare quantitative variations.
Mackinlay's APT system (ACM Trans. Graphics, 5, 110-141, (1986)) was one of the first applications of formal graphical specifications to computer generated displays. APT uses a graphical language and a hierarchy of composition rules that are searched through in order to generate two-dimensional displays of relational data. The Sage system (Roth, et al., (1994), Proc. SIGCHI '94, 112-117) extends the concepts of APT, providing a richer set of data characterizations and forming a wider range of displays. The existing art also provides for the assignment of a mark based upon the innermost data column and row of a dataset (Hanrahan, et al., U.S. Pat. application Ser. No. 11/005,652, “Computer System and Methods for Visualizing Data with Generation of Marks”). Heuristically guided searches have also been used to generate visualizations of data (Agrawala, et al., U.S. Pat. No. 6,424,933, “System and Method for Non-Uniform Scaled Mapping”).
A drawback with the formal graphical specifications of the art is that they do not provide any guidance to a user as to useful and clear visual formats in which a set of data could be rendered. The rendering of the data is such that there is no analysis to examine the resulting visualization for clarity or usefulness. Further, in the use of heuristic searches (trial-and-error method), the searches fail, leaving the user with the problem of finding clear or useful views. Heuristic algorithms can have complex behavior that creates a poor user experience. When a user does not understand why a heuristic algorithm generates certain views, the algorithm becomes unpredictable to the user and the user will not be inclined to use the algorithm.
Based on the background state of the art, as described herein, what is needed are improved methods and graphical interfaces wherein the initial visualization of data has been determined to be a clear and useful visualization, and this visualization is then automatically presented to the user.