Computer database systems that are used for data warehousing frequently store pre-computed summary information in summary tables in order to speed up query processing. The data from which the summary tables are generated are referred to as base data. The tables that contain the base data are referred to as base tables.
Summary tables typically store aggregated information, such as "sum of PRODUCT_SALES, by region, by month." Other examples of aggregated information include counts of tally totals, minimum values, maximum values, and average calculations. As new data is periodically added to the base tables, the summary information needs to be updated (i.e., refreshed) to reflect the new base data.
One approach to refreshing summary tables is referred to as the "total refresh" or "full refresh" approach. According to the total refresh approach, the values in summary tables are recalculated based on all of the base data every time new base data is supplied. Systems that employ full refresh approach have the disadvantage that the recreation process is a relatively lengthy operation due to the size and number of tables from which the summary information is derived. For example, when ten new rows are added to a particular base table that contains a million rows, a total refresh operation would have to process all one million and ten rows of the base table to regenerate the summary tables derived using the base table.
The process of updating summary information may be improved by performing incremental refresh, where rather than generating a new set of summary information based on calculations that use all of the base data, the summary information is updated based on previous summary values and the new base data.
One difficulty associated with performing incremental refresh is that a single summary table may contain summarized data derived from multiple base tables. For example, assume that a database includes the three base tables illustrated in FIG. 1. Referring to FIG. 1, a PRODUCT_SALES table 106 contains information about specific sales made by a company. A LOCATION table 104 contains information about where the sales took place. A TIME table 102 contains information about the times at which sales were made.
FIG. 2 illustrates a SALESREPORT table 202 that stores the pre-computed result of an aggregate query based on the PRODUCT_SALES table 106, LOCATION table 104 and TIME table 102. In the specific example given, SALESREPORT table 202 stores the sum of all sales made in each city during each month.
Information that defines a summary table is referred to herein as a "summary definition". A summary definition indicates (1) the location of the base data that is used to derive the summary table, and (2) how the base data from the base tables should be aggregated to derive the summarized data. In many systems, summary definitions take the form of queries. For example, the query for the SALESREPORT table 202 may be expressed in a relational database system using the following SQL language statement which joins the three base tables:
SELECT time.month, location.city, SUM (product_sales.total) sumsales FROM time t1, location t2, PRODUCT_SALES t3 WHERE time.time_id=product_sales.time_id AND location.loc_id=product_sales.loc_id GROUP BY month, city
The "SELECT" line indicates that columns that are to be used to generate the summary table. The "FROM" line indicates the tables that have those columns. The "WHERE" line indicates the criteria for joining values from one table with corresponding values from the other tables. For example, values in row 108 of TIME table 102, row 110 of LOCATION table 104, and row 112 of PRODUCT_SALES table 106 would be joined together because the time_id value in row 108 matches the time_id value in row 112, and the loc_id value in row 110 matches the loc_id value in row 112.
The "GROUP BY" line indicates that the rows retrieved by the select statement should be put into groups for each month/city combination. The "sum" function in the "SELECT" line indicates that for each of those groups, the values from the "total" columns are to be summed. The resulting summary table will thus have one row for every month/city combination, and that row will have a column called "sumsales" that contains the sum of the "total" column values for that month/city combination.
The SALESREPORT table 202 illustrated in FIG. 2 is an example of the summary table that would be generated in response to the summary table definition specified above. The SALESREPORT table 202 has the three rows specified in the SELECT line of the summary table definition: month, city, and sumsales.
If the system that maintains the SALESREPORT table 202 does not support incremental refresh, the query listed above must be run against the base tables each time data is added to any of the base tables. Rather than re-compute the entire contents of SALESREPORT when new data is added to the system, it would be more efficient to re-compute just the changes to the existing SALESREPORT that result from the new data that was added to the base tables.
Prior mechanisms for incremental refresh have relied upon the re-computing of aggregates as data is being loaded into base tables. Thus, as one set of rows is added to a base table, the summary table is updated based on the data in the new rows. As a second set of rows is added to a base table, the summary table is again updated based on the data in the second set of rows. This is not a practical nor efficient solution for two reasons: first, the database user may have business reasons to not re-compute the summary each and every time data is loaded into base tables; and second, combining aggregation with data loading degrades the performance of the data-loading procedure.
Based on the foregoing, it is clearly desirable to provide a practical and efficient mechanism for incremental refresh of summaries that allows refresh to be deferred until after base data has been loaded into the data warehouse. It is further desirable that the mechanism correctly and efficiently refresh the summary regardless how many times new data has been added to the base table since the last time the summary table was refreshed.
In addition, it is desirable to provide an incremental refresh mechanism produces the correct results even when multiple base table updates have been loaded since the last refresh operation. It is further desirable that the mechanism perform correctly if a particular base table has had new data loaded into it multiple times since the last refresh of the summary table. The mechanism for incremental refresh should also perform multiple operations in parallel in order to perform the incremental refresh more quickly.