The present invention relates to database systems and, more particularly, to techniques for handling function-defined hierarchical dimensions.
In a database management system (DBMS), data is stored in one or more data containers, each container contains records, and the data within each record is organized into one or more fields. In relational database systems, the data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns. In object oriented databases, the data containers are referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes. Other database architectures may use other terminology.
Systems that implement the present invention are not limited to any particular type of data container or database architecture. However, for the purpose of explanation, the examples and the terminology used herein shall be that typically associated with relational databases. Thus, the terms xe2x80x9ctablexe2x80x9d, xe2x80x9crowxe2x80x9d and xe2x80x9ccolumnxe2x80x9d shall be used herein to refer respectively to the data container, record, and field.
Computer database systems that are used for data warehousing frequently store pre-computed summary information in summary tables in order to speed up query processing. The data from which the summary tables are generated are referred to as base data. The tables that contain the base data are referred to as base tables. Summary tables typically store aggregated information, such as xe2x80x9csum of PRODUCT_SALES, by region, by month.xe2x80x9d Other examples of aggregated information include counts of tally totals, minimum values, maximum values, and average calculations.
In the context of database systems, a xe2x80x9cdimensionxe2x80x9d is a list of values that provide categories for data. A dimension acts as an index for identifying values of a variable. For example, if sales data has a separate sales figure for each month, then the data has a MONTH dimension. That is, the data is organized by month.
Dimensions may have many levels of granularity, where a hierarchical relationship exists between members of the various levels. For example, one dimension may be xe2x80x9cgeographyxe2x80x9d, where geography has the following levels: city, state, region, country. Of these levels, xe2x80x9ccityxe2x80x9d has the finest granularity, while xe2x80x9ccountryxe2x80x9d has the coarsest. Each level of a hierarchical dimension is associated with a set of values. For example, the xe2x80x9ccityxe2x80x9d level of the geography dimension may be associated with values xe2x80x9cNYCxe2x80x9d, xe2x80x9cLAxe2x80x9d, and xe2x80x9cSFxe2x80x9d, each of which represents a city. The values associated with a given hierarchical level are referred to as the xe2x80x9cgranulesxe2x80x9d of that level. The values associated with the level of finest granularity are referred to as base granules.
In analytical applications, the data stored in tables of a data warehouse is generally aggregated along dimensions and hierarchies. A typical example is the summation of dollar sales by time, where time is a dimension which includes different levels such as day, week, month, quarter, and year. The time granules occur at different levels of data abstraction and form a hierarchy.
Typically, an analytical application asks for an aggregated result up to some level of data abstraction. For example, the application may ask for sum-of-sales by week, where all sale amounts for each week are summed into a single aggregated value. Other examples are sum-of-sales by month, sum-of-sales by year, etc.
Because of the hierarchical relationship, if sum-of-sales by month is already computed, then sum-of-sales by year can be simply computed by summing the monthly sum-of-sales. The technique of deriving values for a coarser level in the hierarchy based on values associated with a finer level in the hierarchy is known as xe2x80x9crolling upxe2x80x9d the values. For example, the monthly sum-of-sales may be rolled up into yearly sum-of-sales.
In a data warehouse, a dimension such as xe2x80x9ctimexe2x80x9d, is usually broken out into different granules, where a different table column is used to store the granules for each of the hierarchical levels. The granules are hierarchically related to each other. For example, cal_day xe2x86x92 cal_month xe2x86x92 cal_quarter xe2x86x92 cal_year represents a calendar hierarchy in a time dimension. Another example, store xe2x86x92 city xe2x86x92 state xe2x86x92 region xe2x86x92 country represents a geographic hierarchy in a store dimension.
A hierarchy, such as the calendar hierarchy, can be declared as follows:
This declaration includes various LEVEL statements that identify the levels of the hierarchy. From the perspective of the database server, the LEVEL statements identify the various columns that should be created for the time_dim dimension table. The declaration further includes a HIERARCHY section that identifies the hierarchical relationships between the various the levels. Once the hierarchy xe2x80x9ccalendar _rollupxe2x80x9d is declared, a database server can use this information to determine which levels can be rolled up to other levels. This information is used by the server when it rewrites queries in terms of materialized views.
A materialized view is a table where the pre-computed data corresponding to a materialized view definition is stored. For example, a materialized view xe2x80x9cmv1xe2x80x9d may be defined as follows:
CREATE MATERIALIZED VIEW mv1
AS
SELECT t.month, t.quarter, t.year,
sum(f.dollar_sales) as month_sales
FROM fact_tab f, time_tab t
WHERE f.time_key=3D t.time_key
GROUP BY t.month, t.quarter, t.year;
Once a materialized view has been created, a database server may answer some queries that are issued against the base table with data from the materialized view. For example, a user may submit the following query (Q1) that asks for sum-of-sales by month:
Query Q1:
SELECT t.month, SUM(f.dollar_sales) as sum_sales
FROM fact_tab f, time_tab t
WHERE f.time_key=3D t.time_key
GROUP BY t.month;
Query Q1 requests a join to be performed between two base tables: fact_tab and time_tab. Scanning the based tables and performing the join operation may involve a significant amount of overhead. To avoid this overhead, the server can rewrite query Q1 to access data from mv1, instead of performing a join of fact_tab and time_tab. In the present example, query Q1 maybe rewritten as:
Rewritten query Q1xe2x80x2:
SELECT mv1.month, mv1.month_sales as sum_sales
FROM mv1;
The rewritten query Q1xe2x80x2 produces the same results as the original query Q1, but requests the retrieval of data from mv1 rather than from a join between fact_tab and time_tab. Consequently, the amount of overhead required to process the rewritten query Q1xe2x80x2 may be several orders of magnitude less than the overhead required to process the original query Q1.
In the example given above, query Q1xe2x80x2 requires sales to be summed by month. Conveniently, materialized view mv1 contains data that has already been summed by month. Consequently, rewritten query Q1xe2x80x2 does not involve any additional aggregation. However, when the hierarchical relationship between the levels of a dimension are known, queries that require aggregation at a relatively coarser level of a dimension can be rewritten to access a materialized view that stores data that has been aggregated at a relatively finer level of that dimension.
For example, suppose the user submits a query (Q2) that asks for sum-of-sales by year. Knowing the calendar _rollup hierarchy declared in time_dim, the server can still rewrite Q2 to access data from mv1 instead of performing a join of fact_tab and time_tab.
Query Q2:
SELECT t.year, SUM(f.dollar_sales) as sum_sales
FROM fact_tab f, time_tab t
WHERE f.time_key=3D t.time_key
GROUP BY t.year;
To access mv1, query Q2 may be rewritten as:
Rewritten query Q2xe2x80x2:
SELECT mv1.year, SUM(mv1.month_sales) as sum_sales
FROM mv1
GROUP BY mv1.year;
In the rewritten query Q2xe2x80x2, the month_sales values stored in mv1 are summed up (rolled up) into yearly sum-of-sales. The server knows that yearly sales can be derived from monthly sum_of_sales because of the hierarchy information stored in the dimension declaration.
As explained above, one mechanism for defining a hierarchical dimension, the various levels thereof, and the relationships between the levels, is through a xe2x80x9cCREATE DIMENSIONxe2x80x9d statement that has the following form:
The above statement, which creates a dimension called time_dim, declares hierarchical relationships between different columns of a dimension table time_tab. Specifically, the levels cal_day, cal_month, cal_quarter, and cal_year are respectively associated with columns time_tab.day, time_tab.month, time_tab.quarter, and time_tab.year of the time_tab dimension table. FIG. 1A illustrates a time_tab dimension table that may be created in response to such a declaration.
The dimension table 102 embeds the hierarchical relationships between granules in the various levels of the dimension. For example, row 108 indicates a mapping between the day granule xe2x80x9c3xe2x80x9d, the month granule xe2x80x9cm5xe2x80x9d, the quarter granule xe2x80x9cq2xe2x80x9d and the year granule xe2x80x9c1988xe2x80x9d. When a hierarchical dimension is represented by a dimension table that has one column for each hierarchical level, rolling up aggregate values from a finer level to a coarser level may merely involve a re-aggregation of aggregated values using the appropriate column at the coarser level, or a join between the materialized view and the dimension table followed by a re-aggregation of aggregated values using the coarser level column.
For example, the materialized view may be created with a column for each level of the hierarchy, and the roll-up may be performed by aggregating based on groups associated with the appropriate column. For example, mv1 was created with a year column, so rolling values in mv1 up to the year level may be performed based on the year column of mv1.
Alternatively, if columns at coarser levels of the hierarchy, such as quarter and year, are not stored in mv1, the database server may join mv1 to time_tab 102 to determine the year value associated with each of the rows in mv1, and aggregate the dollar_sales value of all rows of mv1 that correspond to the same year value.
Another technique for defining a hierarchical dimension involves supplying functions which, when applied to granules at finer levels of a hierarchical dimension, produce the corresponding granules for coarser levels of the hierarchical dimension. For example, rather than have a dimension table (such as time_tab 102) to indicate the hierarchical relationship between granules in the various levels of the time dimension, various user functions may be provided.
For example, a time dimension can be stored in a single column rather than in a set of columns. A base granule in its encoded form is stored in a single column, and a set of user functions are provided to derive coarser granules from the base granule. FIG. 1B illustrates an example of such dimension table (dim_tab2120), where encoded based granules for the time dimension are stored in a single xe2x80x9csale_datexe2x80x9d column. A user function to_day converts the base granule sale_date into the corresponding calendar day, another user function to_month converts sale_date into corresponding calendar month, etc. The relationship between the levels of the time dimension hierarchy may be represented as: to_day(sale_date) xe2x86x92 to_month(sale_date) xe2x86x92 to_quarter(sale_date) xe2x86x92 to_year(sale_date). A hierarchy defined in this fashion is referred to herein as a xe2x80x9cfunction-defined hierarchyxe2x80x9d.
It is possible to derive coarser granules from base granules when the base granules are encoded to store information about the corresponding coarser granules. Example of situations in which coarser granules may be derived from base granules include:
(1) a base granule based on the standard ANSI SQL datetime datatype which encodes information about time-of-day, day, week, month, quarter, year, century, etc.
(2) a base granule that is an object class encapsulating information about all coarser granules with corresponding methods to extract coarser granules from the base class, and
(3) a base granule that is a user-defined datatype with corresponding user-defined functions that retrieve or compute coarser granules from the user-defined base granule.
When a hierarchy level is associated with a function, the hierarchy level does not correspond to a column of a dimension table. In fact, a dimension table need not even exist for a dimension whose hierarchy levels are defined by function. Thus, while columns xe2x80x9cmonthxe2x80x9d and xe2x80x9cyearxe2x80x9d of time_tab 102 are used to determine which year corresponds to a particular month value for a time dimension without a function-defined hierarchy, the function to_year is used to determine which year corresponds to a particular sale_date value in a time dimension that has a function-defined hierarchy.
Just as with hierarchical dimensions that are not function-defined, it is possible to create summary tables that store pre-computed aggregate data that has been aggregated up to a particular level of a function-defined hierarchy. For example, the following statement defines a materialized view mv2 in terms of the base granule xe2x80x9csale_datexe2x80x9d and user-defined functions:
CREATE MATERIALIZED VIEW mv2
AS
SELECT to_month(t.sale_date) as month,
to_quarter(t.sale_date) as quarter,
to_year(t.sale_date) as year,
sum(f.dollar_sales) as month_sales
FROM fact_tab f, time_tab2 t
WHERE f.time_key=t.time_key
GROUP BY to_month(t.sale_date),
to_quarter(t.sale_date),
to_year(t.sale_date);
FIG. 2 is a block diagram that illustrates a materialized view mv2 that may have been created in response to this statement.
Just as with hierarchical dimensions that are not function-defined, it is desirable to be able to satisfy some queries that are directed to fact_tab and dim_tab2 with data from mv2, rather than by scanning and joining the base tables. For example, a user may submit a query (Q3) that asks for sum-of-sales by month by using the user-defined function to_month, as follows:
Query Q3:
SELECT to_month(t.sale_date) as month,
SUM(f.dollar_sales) as sum_sales
FROM fact_tab f, time_tab2 t
WHERE f.time_key=3D t.time_key
GROUP BY to_month(t.sale_date);
Satisfying query Q3 from the base tables fact_tab and time_tab2 incurs the potentially enormous overhead associated with scanning and joining the two tables. So Q3 can be rewritten in terms of mv2 to cut down the overhead based on the fact that the GROUP BY clause of Q3 is a proper subset of the GROUP BY clause of mv2.
However, if the GROUP BY clause of Q3 was to_year(sale_date) and the GROUP BY clause of mv2 had to_month(sale_date), then the Q3 cannot be rewritten using mv2 unless a function-defined hierarchy declares that it is possible to derive year granules from month granules.
Based on the foregoing, it is desirable to provide a technique that allows queries that are not directed to a materialized view to be answered with data from the materialized view even when the queries require aggregation across a hierarchical dimension that has a function-defined hierarchy.
Techniques are provided to allow function-defined hierarchies to be registered with a database server. The information provided to the server during the registration process is used by the server to determine how to roll up data that has been aggregated at one level of a function-defined hierarchy to another level of the function-defined hierarchy. Techniques are also provided to perform rollup from one level of a function-defined hierarchy to another level of the function-defined hierarchy on data stored in a materialized view. Further, techniques are provided for rewriting queries that require aggregation at one level of a function-defined hierarchy to cause them to access data from a materialized view that stores data at a different level of the function-defined hierarchy.