In the last several years, large databases have become common in a variety of applications. Corporations are creating large data warehouses of historical data on key aspects of their operations. Corporations are also creating many small databases using desktop applications that are created to examine some specific aspect of their business. International research projects such as the Human Genome Project and the Sloan Digital Sky Survey are generating massive scientific databases. One challenge with these databases is the extraction of meaning from the data they contain: to discover structure, find patterns, and derive causal relationships. The sheer size of these data sets complicates this task. Interactive calculations that require visiting each record are not plausible. It is also not feasible for an analyst to reason about or view the entire data set at its forest level of detail. Moreover, even when the data sets are small, their complexity often makes it difficult to glean meaning without applying aggregations or creating simplifying summaries.
Imposing meaningful hierarchical structure on databases provides levels of abstraction that can be leveraged by both the computer and the analyst. These hierarchies can come from several different sources. Some hierarchies are provided by the inherent nature of the database. Data mining algorithms, such as decision trees and clustering techniques that classify the data and thereby automatically derive hierarchies can be used to determine database hierarchy. Part of the analysis task when dealing with automatically generated hierarchies is in understanding and trusting the results. See, for example, 2001, Thearling et al., “Visualizing Data Mining Models” in Information Visualization in Data Mining and Knowledge Discovery, Fayyad, Grinstein and Wierse eds., Morgan Kaufman, which is hereby incorporated by reference in its entirety.
FIG. 9 illustrates the hierarchy for a time dimension. Within a time dimension, there are four levels. They are “All”, year, quarter, and month. Simple hierarchies, like the one shown in FIG. 9, are commonly modeled using a star schema. The entire dimension hierarchy is represented by a single dimension table joined to the base fact table. In this type of hierarchy, there is only one path of aggregation. However, there are more complex dimension hierarchies where the aggregation path can branch. For example, a time dimension might aggregate from “day” to both week and “month”.
To provide another illustration of the concept of a star schema, consider the case in which one wishes to analyze monthly total product sales for a department store by breaking down the data by region and store. Raw data can come in the form of product managers' (FIG. 1) and regional managers' (FIG. 2) quarterly sales reports. Once the data has been collected and refined, it may reside in a large base table. In addition, there may be adjunct lookup tables. A star schema for this base data is shown in FIG. 3. The table schema of FIG. 3 is called a star schema because the central fact table is depicted as surrounded by each of the dimension tables that describe each dimension. In this example, the base sales data table is the fact table and each lookup table is a dimension table.
The stores, weeks, and products columns in the fact table in FIG. 3 contain numeric values. Fact tables can grow to huge numbers of rows. The lookup tables contain hierarchy information relating each store, week, and product with its higher-level aggregations. For example store 1 in the base table of FIG. 3 connects with the “Store Lookup” table where it has the name Ridgewood and rolls up to the Northeast region. Product 2 in the base table connects with the “Product Lookup” table where it has the name olive oil soap and rolls up into the product type soap in the skin care products group. Thomsen, 1997, OLAP Solutions: Building Multidimensional Information Systems, Wiley Computer Publishing, New York, which is hereby incorporated by reference in its entirety.
The most common schemata found in databases are the star schema and snowflake schema. Each schema has a fact table containing data items of interest (measures) in the analysis for which the database is built. These data items might be transaction amounts such as the amount invested in a mutual fund or the profit on a sales transaction. The fact table is surrounded by dimension tables containing detailed information used to summarize the fact table in different ways. An illustration of a star schema has been provided (FIG. 3). FIG. 4 illustrates a snowflake schema that includes hierarchy. The snowflake and star schema provide a conceptual multidimensional view of the database. The database is a core set of measures characterized by a number of dimensions rather than a set of interrelated tables. This organization correlates directly with the typical analysis query that summarizes a few quantitative attributes (or measures) such as profit or sales by several characterizing attributes (or dimensions) such as product, location, or date over a large number of triples. The primary differences between the star and snowflake schema arise in how they model hierarchical structures on the dimensions.
When referring to values within a dimension hierarchy, a dotted notation can be used to specify a specific path from the root level “All” (FIG. 9) of the hierarchy down to the specified value. Specifically, to refer to a value on level m of a hierarchy, the dimension name is first optionally listed, then zero or more of the (m−1) intermediate ancestor values, and then finally the value on the mth level, all separated by periods. For example, the Jan node on the Month level in the time hierarchy that corresponds to January, 1998, can be referred to as 1998.Qtrl Jan. When this notation is used, the reference is called a qualified value. When a value is simply described by its node value (without any path to the root node) the reference is called an unqualified value.
3.1 Types of Databases
One form of database is a relational warehouse, such as a structured query language (SQL) database. Relational warehouses organize data into tables. Each row in a table corresponds to a basic entity or fact and each column represents a property of that entity. See, for example, Thomsen, 1997, OLAP Solutions: Building Multidimensional Information Systems, Wiley Computer Publishing, New York. For example, a table may represent transactions in a bank, where each row corresponds to a single transaction. As such, each transaction may have multiple properties, such as the transaction amount, the account balance, the bank branch, and the customer. As used herein, a row in a table is referred to as a tuple or record, and a column in the relation is referred to as a field. Such tables are also referred to as relations. As such, a relation is defined as a database table that contains a set of tuples.
It is possible to create dimension tables and star schemas in relational warehouses. A single relational warehouse will contain many heterogeneous but interrelated tables. The fields (columns) within a table can be partitioned into two types: dimensions and measures. Dimensions and measures are similar to independent and dependent variables in traditional analysis. For example; the bank branch and the customer would be dimensions, while the account balance would be a measure.
Multidimensional databases are structured as n-dimensional data cubes. Each dimension in the data cube corresponds to one dimension in the relational schema (e.g., in the star schema, snowflake schema etc.). Each cell in the data cube contains all the measures in the relational schema corresponding to a unique combination of values for each dimension. The dimensions within a data cube are often augmented with a hierarchical structure. This hierarchical structure can be derived from the semantic levels of detail within the dimension or generated from classification algorithms. Using these hierarchies, the analyst can explore and analyze the data cube at multiple meaningful levels of aggregation calculated from a base fact table (e.g., a relation in the database with the raw data). Each cell in the data cube now corresponds to the measures of the base fact table aggregated to the proper level of detail.
The aggregation levels are determined from the hierarchical dimensions. Each dimension is structured as a tree with multiple levels. Each level corresponds to a different semantic level of detail for that dimension. Within each level of the tree there are many nodes. Each node corresponds to a value within the domain of the level of detail that the node is in. The tree forms a set of parent-child relationships between the domain values at each level of detail.
3.2 Data Exploration of Databases
Visualization is a powerful tool for exploring large data, both by itself and coupled with data mining algorithms. However, the task of effectively visualizing large databases imposes significant demands on the human-computer interface to the visualization system. The exploratory process is one of hypothesis, experiment, and discovery. The path of exploration is unpredictable, and analysts need to be able to easily change both the data being displayed and its visual representation. Furthermore, the analyst must be able to first reason about the data at a high level of abstraction, and then rapidly drill down to explore data of interest at a greater level of detail. Thus, the interface must expose the underlying hierarchical structure of the data and support rapid refinement of the visualization.
One tool known in the art is Polaris. See, for example, Stolte, Tang, and Hanrahan, 2002, IEEE Transactions on Visualization and Computer Graphics 8. Polaris is built upon an algebraic formalism for constructing visualizations of relations. The state of the user interface is a visual specification. This specification is interpreted according to the formalism to determine both the series of queries necessary to retrieve the requested data, as well as the mapping and layout of the resulting tuples (rows of data in the database) into graphical marks. However, the original form of Polaris does not make use of the structure of hierarchically structured dimensions that are found in a hierarchical database. Therefore, the original form of Polaris does not provide a satisfactory means for the exploratory analysis of databases having a hierarchical structure.
Other known art in the field can be broken down into two categories (i) the visual exploration of databases and (ii) the use of data visualization in conjunction with data mining algorithms. These two categories will be considered in turn.
3.2.1 Visual Data Exploration of Databases
Visual query tools such as VQE (Merthick et al., 1997, “An Interactive) Visualization Environment for Data Exploration,” Proc. of Knowledge Discovery in Databases, p. 2-9), Visage (Roth et al. 1996, “Visage: A User Interface Environment for Exploring Information”. in Proceedings of Information Visualization, p. 3-12), DEVise Livny et al, 1997, “DEVise: Integrated Querying and Visual Exploration of Large Datasets” in Proc. of ACM SIGMOD), and Tioga-2 (Woodruff et al, 2001, Journal of Visual Languages and Computing, Special Issue on Visual Languages for End-user and Domain-Specific Programming 12, p. 551-571) have focused on building visualization tools that directly support interactive database exploration through visual queries. Users can construct queries and visualizations directly through their interactions with the interface. These systems have flexible mechanisms for mapping gum results to graphs and support mapping database tuples to retinal properties of the marks in the graphs. Of these systems, only Tioga-2 provides built-in support for interactively navigating through and exploring data at different levels of detail. However, the underlying hierarchical structure must be created by the analyst during the visualization process. These visual query tools do not leverage the hierarchical structure that is already encoded in the database. Because of this drawback, VQE, Visage, DEVise, and Tioga-2 are not satisfactory tools for facilitating exploratory analysis of databases having a hierarchical structure.
Tools such as XrndvTool (Ward, 1994, “XmdvTool: Integrating multiple methods for visualizing multi-variate data,” Proceedings of MER Visualization, pp. 326-336), Spotfire (BioNorth, Ottawa, Ontario, Canada, November, 2002) and Xgobi (Buja et al., 1996, Journal of Computational and Graphical Statistics 5, p. 78-99) provide the analyst with a set of predefined visualizations such as scatterplots and parallel coordinates. These systems are augmented with extensive interaction techniques (e.g., brushing and zooming) that can be used to refine the queries. However, such methods do not provide tools to interactively construct and refine a wide range of displays to suit the analysis process. Furthermore, of these systems, only XmdvTool supports the exploration of hierarchically structured data. XmdvTool has been augmented with structure-based brushes (see Fual et al, Proc. of Information Visualization, October 1999, pp. 58-64) that allow the user to control the display's global level of detail (based on a hierarchical clustering of the data) and to brush records based on their proximity within the hierarchical structure. However, such an approach limits the user, in this case to viewing a single hierarchical structuring of the data and a single ordering of that hierarchy to make proximity meaningful. For this reason, XmdvTool, Spotfire, and Xgobi are not satisfactory tools for facilitating exploratory analysis of databases having a hierarchical structure.
Another known visualization system, VisDB (Keim and Kriegel, 1994, IEEE Computer Graphics and Applications 14, p. 40-49) focuses on displaying as many tuples (rows of data) as possible to provide feedback as users refine their queries. This system also displays tuples that do not satisfy the query, indicating their “distance” from the query criteria using spatial encodings and color. This approach helps the user avoid missing important data points that fall just outside of the selected query parameters. However, VisDB fails to take advantage of the hierarchical structure of databases. For example, VisDB does not provide an extensive ability to drill down and roll up data, thereby allowing the analyst to get a complete overview of the data set before focusing on detailed portions of the database. For this reason VisDB is not a satisfactory tool for facilitating exploratory analysis of databases having a hierarchical structure.
3.2.2 Visualization and Data Mining
Many research and commercial systems use visualization in conjunction with automated data mining algorithms. One common application of visualization together with data mining is in helping analysts understand models generated by the data mining process. For example, several researchers have developed techniques specifically for displaying decision trees, Bayesian classifiers, and decision table classifiers (Becker, 1998, Proc. of Information Visualization, p. 102-105), and these visualization techniques have been incorporated into products such as SGrs MineSet (Brunk et al., “MineSet: an integrated system for data mining,” Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, p. 135-138).
Other approaches to coupling visualization and data mining have traditionally been employed within focused domains. One approach is to use visualization to gain an initial understanding of a database and then apply algorithmic analysis to the identified areas of interest. See, for example, Kohavi, “Data Mining and Visualization,” Frontiers of Engineering. Reports on Leading-Edge Engineering from the 2000 NAB symposium on Frontiers of Engineering, National Academy Press, 2001 as well as Therling et al., 2001, “Visualizing Data Mining Models,” Information Visualization in Data Mining and Knowledge Discovery, Fayyad, Frinstein, and Wierse; eds., Morgan Kaufman. The other major approach is to use data mining to compress the size and dimensionality of the data and then use focused visualization tools to explore the results. See, for example, Healey, 1998, Proc. Graphics Interface, pp. 177-184 as well as Welling and Derthick, 2000, “Visualization of Large Multi-dimensional datasets,” Proceedings of Virtual Observatories of the Future.
The drawback with the approaches described in this section is that they are focused on a particular algorithm or a single phase of the discovery process. For these reasons, known visualization and data mining tools do not provide a satisfactory way to explore and analyze databases that have a hierarchical structure.
3.2.3 Table Based Displays
Another area of related work is visualization systems that use table-based displays. Table displays such as scatterplot matrices (Hartigan, Journal of Statistical Computation and Simulation, 4, pp. 187-213) and Trellis displays (Becker, Displays: A Multi-Dimensional Data Visualization Tool for Data Mining, Third Annual Conference on Knowledge Discovery in Databases, August 1997) have been used extensively in statistical data analysis. However, the drawback of such visualization systems is that they present static graphics that the user cannot interact with in order to refine database queries or otherwise explore database content.
Interactive table displays have also been developed. Pivot tables allow analysts to explore different projections of large multi-dimensional datasets by interactively specifying assignments of fields to the table axes. However, pivot tables are limited to text-based displays.
The Table Lens (Rao and Card, The Table Lens: Merging Graphical and Symbolic Representations in an Interactive Focus+Context Visualization for Tabular Information, In Proc. of SIGCHI 1994, pp. 318-322) and FOCUS (Spenke et al. FOCUS: The Interactive Table for Product Comparison and Selection. In Proc. of the ACM Symposium on User Interface Software and Technology, November 1996) visualization systems provide table displays that present data in a relational table view, using simple graphics in the cells to communicate quantitative values. However, the Table Lens does not support queries. In addition, FOCUS is limited to object-attribute tables that do not have hierarchical structure.
3.3 Formal Graphical Presentations
In addition to various software programs, the known art further provides formal graphical presentations. Bertin's Semiology of Graphics, 1983, University of Wisconsin Press, Madison Wis. is an early attempt at formalizing graphic techniques. Bertin developed a vocabulary for describing data and the techniques for encoding the data into a graphic. Bertin identified the retinal variables (position, color, size, etc.) in which data can be encoded. Cleveland (The Elements of Graphing Data, 1985, Wadsworth Advanced Books and Software, Pacific Grove, Calif.; Visualizing Data, 1993, Hobart Press) used theoretical and experimental results to determine how well people can use these different retinal properties to compare quantitative variations.
Mackinlay's APT system (ACM Trans. Graphics, pp. 110-141, April 1986) is one of the first applications of formal graphical specifications to computer generated displays. APT uses a set of graphical languages and composition rules to automatically generate two-dimensional displays of relational data. The Sage system (Roth et al., 1994, Proc. SIGCHI '94, pp. 112-117) extends the concepts of APT, providing a richer set of data characterizations and generating a wider range of displays.
Livny et al. (Proc. ACM SIGMOD, May 1997) describe a visualization model that provides a foundation for database-style processing of visual queries. Within this model, the relational queries and graphical mapping necessary to generate visualizations are defined by a set of relational operators. The Rivet visualization environment (Bosch et al., 2000, Computer Graphics, pp. 68-73) applies similar concepts to provide a flexible database visualization tool. Wilkinson (The Grammer of Graphics, New York, Springer, 1999; U.S. Pat. No. 6,492,989) has developed a language for describing traditional statistical graphs. Further, Wilkinson proposes an interface for generating a subset of the specifications expressible within his language.
The drawback with these known formal graphical specifications is that they do not provide any tools for generating a database query. Furthermore, Bertin's work is purely theoretical and was never implemented as a computer program. APT assumes a given database structure and automatically generates a graphic with no user involvement or support for user involvement. As such, these known formal graphical specifications do not provide a satisfactory way to analyze databases.
3.4 State of the Known Art
Programs used to visually explore databases have been described. From this survey, it is apparent that known visualization and data mining tools do not provide a satisfactory way to explore and analyze databases that have a hierarchical structure. Thus, given the above background, what is needed in the art is an interactive visual exploration tool that facilitates exploratory analysis of databases having a hierarchical structure.