1. Field of the Invention
The present invention relates to methods and systems for the display of multi-dimensional data and, in particular, to methods and systems for dynamically determining and presenting appearance and spatial attribute values of entities of the multi-dimensional data over a sequence to assist in the recognition of patterns and trends within the data.
2. Background Information
One of the consequences of the increasing computerization and digitization of almost all human activities is the presence of vast quantities of complex data. While capturing multiple aspects of activities and phenomena is becoming easier, comprehending the data so acquired is increasingly challenging.
The process of comprehending data involves the reduction of the data by a human to a series of mental representations of the data, often fitting these representations into a pre-existing mental model. The mental model abstracting the data enables the data user to make decisions and take or avoid actions based on the model and the user's projection of its consequences. Such models necessarily simplify the data. The same data may support several different models based on differing presuppositions.
Abstract data is commonly used in business and technical pursuits. Such data consists of categories, rankings, and real valued measurements gathered by people or machines. Standard methods have been developed for organizing, summarizing, and presenting such data, for example, tables, statistics, and graphics. Standard methods have also been developed for organizing the storage and retrieval of such data such as hierarchical, relational, and object oriented databases as well as non-database methods such as “flat” files, spreadsheets, or other data structures.
Collecting, storing and accessing data is only the beginning of process of turning raw (abstract) data into valuable information. In any large-scale data collection process, it is inevitable that errors will occur. Data values may be corrupted or omitted. Entire records may be lost or duplicated. While some types of errors may be identified by routine data operations, other types of errors will only be found during analysis, necessitating correction of the database (or equivalent data structure) and reanalysis. Thus, managing the database that stores the raw data is an active process that requires continuing attention to maintain data quality.
Large databases are often considered as “high dimensional” data (or “highly dimensional” or “multi-dimensional” data). This terminology stems from the ability to plot real valued data as scatter plots. Assigning one database variable to the abscissa (x-axis) of a graph and another database variable to the ordinate (y-axis) of the graph creates the simplest form of scatter plot. A dot is then drawn at the corresponding value pair for each database record. By extension, theoretically, each variable in a database could be assigned to a coordinate (dimension) of the graph and values plotted for all records. Such a representation is physically impossible to realize when the number of dimensions becomes large; however it provides a mental framework for further operations on the database.
The terminology used for describing databases differs among analysts; but generally reflects the mental model of a “table” of data. Modern databases seldom consist of a single table of data values. However; in general, a view of the database contents can be created that appears to be a table of data values. The columns of the conceptual table of data may be called variables, fields, or attributes; generally reflecting different measurements or categories. The rows of the conceptual table may be called records, cases, or observations; generally reflecting instances of measurement or categorization.
Database management systems provide basic database operations, such as storage and retrieval of records based upon selection criteria or filters. Analysis software provides other more advanced types of database operations. Basic operations include entering, updating, deleting and retrieving sets of data from the database. More advanced operation include creating new attributes by transforming original attributes or aggregating sets of attributes or records.
A combination of database operations and graphical display techniques are used to build an intuitive “feel” for data, examine how well putative models perform, identify database errors, and examine relationships among data subsets. Graphical representation tools are highly useful in maintaining and analyzing data. In existing systems, there are a number of graphical formats used to display data two attributes at a time. Three or more attributes are displayed as scatter plots, contour plots, and surface plots among others. In these plots time or an other index attribute is displayed as one coordinate in a static display.
Even using the combination of complex database operations and graphical display techniques, it is difficult to gain an understanding of highly dimensional data because there are simply too many values to mentally track or plot. Thus, highly dimensional data is typically reduced using data transformations to a form that can be displayed with current graphical methods. For example, transforming data by aggregating across variables reduces data dimensionality by creating new variables that summarize several original variables. As a specific example, sales and expenses may be recorded in a corporate database, whereas the difference of the two representing a profit or loss may be more meaningful in a business model and can serve to reduce the amount of data being viewed.
Also, in many data sets, a multitude of dependent variables can be transformed into fewer new independent variables that represent most of the information. Specifically, when each variable in a dataset conceptually corresponds to a dimension or coordinate of some highly dimensional space, it is implicitly assumed that all the variables are orthogonal (uncorrelated or independent). This is seldom true for most data sets. Statistical techniques such a “principle components” or “factor analysis” may be used to define new variables consisting of weighted linear aggregations of the original variables. These new variables are orthogonal and a relative few are needed to typically capture most of the information in the data set.
However, the greater the dimensionality reduction obtained through aggregation and other transformations using these tools, the more information that is lost. The ability to characterize a data set is thus limited by the amount of reduction required to visualize the data using existing tools.
Numerous database management systems and statistical analysis packages are available to perform such transformation and analysis operations. Database management systems typically use a query language such as SQL (structured query language) to allow users to create subsets (views) of the data for analysis. Statistical software provides an interactive user interface for data manipulation and analysis. A user typically runs the statistical software to interface to the data management system to retrieve data subsets, which are then stored and analyzed in a proprietary format of the statistical software.
Once the data is organized and manipulated for viewing, it must still be analyzed to extract information. The major activities of the data analyst may be characterized by model fitting or data exploration. In model fitting, a predefined model exists and the data is used to calculate or pick the parameters of the model, for example, to predict outcomes. In data exploration, visual methods are often used to summarize the data with a goal of identifying an appropriate model. In practice, especially for highly dimensional data, both activities are performed iteratively with models being selected and fit and then discarded for a new model as understanding of the data set improves. The pace at which this process can proceed is limited by the availability of visualization tools that aid the data analyst in viewing the data.
The exponential growth of the computer processing power available for data modeling and graphical data exploration is used by current tools merely to display larger data sets faster. Thus, current data visualization software continues to have several limitations. One limitation is ease of using the software. The ease of use of visualization software is in part limited by the need for the user to pre-process data to align all attributes on the same index. For example, if a multivariate time series is to be visualized, the user must first ensure that all the times match so that data is available at each time index. If data is missing, the user is responsible for specifying how missing data is managed (e.g., deleting the associated record, replacement with mean value, etc.) and performing the operation before visualization can begin.
In addition, currently available software is typically limited to interaction with static data sets. A selected data set can be displayed and the data points queried interactively (often called brushing in the literature). Indexed data (e.g., a time series) requires the set-up of the visualization followed by the generation of an animation. When viewed the animation permits no interaction.
For example, some commercial analysis packages such as Statistica™ (Statsoft, inc.), SAS™ (SAS inc), Splus™ (Insightful, Inc.), and others display changes to data through limited animation facilities that consist of creating a series of plots based on an index variable (which may be an attribute in the data set) and linking these static plots together into an animation. The animation is then viewed. The viewing and interaction with the display is done in a sequence of batch steps: creating a graphic view of the data, indexing this view on some derived variable or attribute, creating a sequence of views, and then viewing sequence as animation. If any changes are desired, the sequence is repeated beginning with regenerating the views of the data.
A major interest in sequential or indexed data is how the relationships among attributes for a set of categories change with the index. In current practice, an animated visualization shows only the current values of changing relationships. Thus, to attempt to see changes over the index values, the user typically views the animation as a continuous loop in the hope that repetition will illuminate the details.