Exploratory data analysis (EDA) is a process of examining multidimensional datasets by looking at the distributions and correlations of fields. Using computer-based data visualization systems and tools, a data analyst might quickly generate and analyze dozens or hundreds of data visualizations (e.g., charts and graphs) as he seeks to understand the data. The process of moving through the multiple dimensions of data is typically iterative. A data analyst may begin with a broad question, and create multiple views (i.e., visualizations of the dataset) that address some part of the question. These views can inform a more-specific question, and so the data analyst might create another view to address that more specific question. These increasingly-specific questions may require the data analyst to change data representations, for instance, to filter the data by zooming or filtering views, and to choose new fields to chart, graph and/or explore. Some of the views that a data analyst generates will contain or lead to interesting insights. However, others may lead to dead ends with less value. When the data analyst has sufficiently addressed the broad question and any follow-up questions, he may continue exploring the dataset with a new broad question and a related series of specific follow-up questions.
Data visualization systems and tools—whether implemented with point-and-click or programmatic user interfaces—support this data exploration process by allowing data analysts to rapidly specify and refine queries, and then view their corresponding data visualizations. Each step in this process involves generating observations of the data. In the context of EDA, an observation is a single fact about the data; it is the unit of knowledge that allows the data analyst to move on to the next step of their analysis. For example, when examining a dataset of flight data, an observation might be, “Airline X is the airline with the most flights in the dataset.” It is a more modest unit than the insights that the data analyst might ultimately hope to infer as the outcome of his analysis process. For instance, an insight might bring in external contextual information and multiple observations that have resulted from many queries. An example of an insight might be, “the biggest airlines have trouble with congestion near the holidays, while smaller airlines do not.”
For this process of generating observations that lead to interesting insights to be effective, the data visualization system or tool in use by the data analyst must be fast enough to enable rapid iteration. Studies have shown that data analysts lose effectiveness when a query result takes more than five hundred milliseconds to return, and when a computer operation takes more than a second to complete, data analysts are more likely to lose their flow of thought. As such, effective data visualization systems or tools will allow the data analyst to work in what is sometimes referred to as interactive time. While no formal definition is recognized, the concept of interactive time simply means that the system provides a level of query responsiveness that allows the data analyst to maintain his concentration and flow of thought.
With smaller datasets, this requirement for data visualization systems and tools to be responsive—that is, rapidly processing queries and generating data visualizations—may not provide any technical challenges. However, with the increasing desire and need to analyze and explore extremely large datasets with millions or multiple millions of records, designing a data visualization system or tool that provides the requisite level of responsiveness becomes a technically challenging problem. Specifically, when dataset sizes exceed even a few million records, data analysts run into two fundamental issues: visual scalability and data processing scalability.
In terms of visual scalability, with extremely large datasets, it is impractical to display every element of the dataset. For instance, the number of records returned from a query may far exceed the available pixels on a high-resolution display. As an example, drawing raw data in a scatterplot without aggregation may lead to over-plotting—drawing many points in the same place—and visual clutter. The data can be grouped on a dimension, however, and a single aggregate measure computed for each group. The simplest such aggregate visualization is a bar chart, in which each bar represents the aggregated value of a group. Other data visualizations involving the aggregation of data are also well known, and to a certain extent, provide a partial solution to the problem of visual scalability.
The other fundamental issue that arises when working with extremely large datasets is data processing scalability—specifically, the time it takes to execute a query against an extremely large dataset often exceeds that which allows a data analyst to be efficient and successful in exploring data and deriving observations. Developers of data visualization systems and tools have approached the issue of query responsiveness in a few different ways. One approach involves precomputing and storing partially-aggregated data results, such that, at query time, the data visualization system can retrieve and assemble these partial answers quickly. However, this approach requires that the appropriate fields be selected for aggregation and optimization, which means far more time and energy are expended in the planning stage, and when the proper fields are not selected, the overall flexibility in how a data analyst goes about querying the data may be significantly reduced.
A second approach involves distributed computing. Specifically, certain data visualization systems and tools distribute a query across many network-connected computers, which process a query against some subset of the large dataset. The final query result is then assembled from the partial results. However, in this type of distributed system, network latencies are introduced, and these network latencies can often last into the seconds.
A third approach is generally referred to as Approximate Query Processing (AQP). AQP involves generating approximate data visualizations, as opposed to precise data visualizations, that are based on a representative subset (e.g., sample) of the dataset. AQP techniques trade accuracy or precision for speed or query responsiveness. As a simple example, with an AQP approach, the sum of a set of values might be approximated by computing the sum of ten percent of the values and then estimating the true sum to be ten times the aggregate value of the sample. This value is an estimate, and carries some uncertainty, which can be expressed as error bounds. Those bounds widen with the variance of the data, and narrow with the square root of the size of the sample.
Some AQP-based data visualization systems or tools create a sample of the data before the data analyst begins her analysis. In other systems, the sampling process might be integrated directly into the database management system. In general, a variety of different sampling and estimation techniques are known to work with AQP-based data visualizations systems. These systems pick a sample and compute a result along with estimated error bounds. With some systems, the analyst may choose either a maximum amount of time that a query can execute, or desired error bounds. To ensure query responsiveness, AQP-based data visualization systems tend to use time bounds to get a best-effort approximation within that time bound.