In the last several years, large databases have become common in a variety of applications. Corporations are creating large data warehouses of historical data on key aspects of their operations. Corporations are also creating many small databases using desktop applications that are created to examine some specific aspect of their business. International research projects such as the Human Genome Project and the Sloan Digital Sky Survey are generating massive scientific databases. One challenge with these databases is the extraction of meaning from the data they contain: to discover structure, find patterns, and derive causal relationships. The sheer size of these data sets complicates this task. Interactive calculations that require visiting each record are not plausible. It is also not feasible for an analyst to reason about or view the entire data set at its forest level of detail. Moreover, even when the data sets are small, their complexity often makes it difficult to glean meaning without applying aggregations or creating simplifying summaries.
Imposing meaningful hierarchical structure on databases provides levels of abstraction that can be leveraged by both the computer and the analyst. These hierarchies can come from several different sources. Some hierarchies are provided by the inherent nature of the database. Data mining algorithms, such as decision trees and clustering techniques that classify the data and thereby automatically derive hierarchies can be used to determine database hierarchy. Part of the analysis task when dealing with automatically generated hierarchies is in understanding and trusting the results. See, for example, 2001, Thearling et al., “Visualizing Data Mining Models” in Information Visualization in Data Mining and Knowledge Discovery, Fayyad, Grinstein and Wierse eds., Morgan Kaufman, which is hereby incorporated by reference in its entirety.
FIG. 9 illustrates the hierarchy for a time dimension. Within a time dimension, there are four levels. They are “All”, year, quarter, and month. Simple hierarchies, like the one shown in FIG. 9, are commonly modeled using a star schema. The entire dimension hierarchy is represented by a single dimension table joined to the base fact table. In this type of hierarchy, there is only one path of aggregation. However, there are more complex dimension hierarchies where the aggregation path can branch. For example, a time dimension might aggregate from “day” to both week and “month”.
To provide another illustration of the concept of a star schema, consider the case in which one wishes to analyze monthly total product sales for a department store by breaking down the data by region and store. Raw data can come in the form of product managers' (FIG. 1) and regional managers' (FIG. 2) quarterly sales reports. Once the data has been collected and refined, it may reside in a large base table. In addition, there may be adjunct lookup tables. A star schema for this base data is shown in FIG. 3. The table schema of FIG. 3 is called a star schema because the central fact table is depicted as surrounded by each of the dimension tables that describe each dimension. In this example, the base sales data table is the fact table and each lookup table is a dimension table.
The stores, weeks, and products columns in the fact table in FIG. 3 contain numeric values. Fact tables can grow to huge numbers of rows. The lookup tables contain hierarchy information relating each store, week, and product with its higher-level aggregations. For example store 1 in the base table of FIG. 3 connects with the “Store Lookup” table where it has the name Ridgewood and rolls up to the Northeast region. Product 2 in the base table connects with the “Product Lookup” table where it has the name olive oil soap and rolls up into the product type soap in the skin care products group. Thomsen, 1997, OLAP Solutions: Building Multidimensional Information Systems, Wiley Computer Publishing, New York, which is hereby incorporated by reference in its entirety.
The most common schemata found in databases are the star schema and snowflake schema. Each schema has a fact table containing data items of interest (measures) in the analysis for which the database is built. These data items might be transaction amounts such as the amount invested in a mutual fund or the profit on a sales transaction. The fact table is surrounded by dimension tables containing detailed information used to summarize the fact table in different ways. An illustration of a star schema has been provided (FIG. 3). FIG. 4 illustrates a snowflake schema that includes hierarchy. The snowflake and star schema provide a conceptual multidimensional view of the database. The database is a core set of measures characterized by a number of dimensions rather than a set of interrelated tables. This organization correlates directly with the typical analysis query that summarizes a few quantitative attributes (or measures) such as profit or sales by several characterizing attributes (or dimensions) such as product, location, or date over a large number of triples. The primary differences between the star and snowflake schema arise in how they model hierarchical structures on the dimensions.
When referring to values within a dimension hierarchy, a dotted notation can be used to specify a specific path from the root level “All” (FIG. 9) of the hierarchy down to the specified value. Specifically, to refer to a value on level m of a hierarchy, the dimension name is first optionally listed, then zero or more of the (m−1) intermediate ancestor values, and then finally the value on the mth level, all separated by periods. For example, the Jan node on the Month level in the time hierarchy that corresponds to January, 1998, can be referred to as 1998. Qtrl Jan. When this notation is used, the reference is called a qualified value. When a value is simply described by its node value (without any path to the root node) the reference is called an unqualified value.