Relational databases store information in collections of tables, in which each table is organized into rows and columns. A popular application of relational database technology is data warehousing, in which an organization's data repository is designed to support the decision-making process for the organization.
FIG. 1 illustrates an exemplary database containing three tables, a sales table 100, a time table 110, and a store table 120, useful for recording and organizing information about a company's sales operation. In this example, the sales table 100 is a fact table, and the time table 110 and the store table are dimension tables. The columns of the sales table 100 hold attributes for the sales made by the company, including the date in a date column 101, the store identifier in a store column 103, and the daily sales amount in a sales column 105. Information about the daily sales of each of the company's stores is stored in a row of the sales table 100. For example, the first row has an entry on Jan. 2, 2001 (in the date column 101) for store 0001 (in the store column 103), which had sales of $5,500 (in the sales column 105). Similarly, the second row has an entry on Jan. 2, 2001 for store 0002, which had sales of $6,000, and the third row has an entry for sales of $5,000 on Jan. 2, 2001 for store 0003. The sales table 100 also has rows for the sales that occurred on Jul. 5, 2001 in the amounts of $4,500, $7,000, and $8,000 for stores 0001, 0002, and 0003, respectively.
The other tables, time table 110 and store table 120, are dimension tables, which characterize the sales table 100's date column 101 and store column 103, respectively. The time table 110 holds information in columns that indicate how the time is to be measured. In this example, the time table 110 has a date column 111, a month column 113, and a quarter column 115. The date column 111 is a unique/primary key column that organizes each date in terms of its month and quarter; e.g. date Jan. 2, 2001 has a value of January in the month column 113 and a value of Q1 in the quarter column 115, and the date Jul. 5, 2001 is in the month of July and in the Q3 quarter. The store table 120 holds information in columns that indicate how the stores are geographically located. In this example, the store table 120 has a store column 121, a region column 123, and a state column 125. The store column 121 is a unique/primary key column that organizes each unique identifier of each store, indicates the ZIP code of the store in the region column 123 and the state of the store in the state column 125; e.g. store 0001 is in the 94065 ZIP code (in the region column 123) and in California (in the state column 125). Similarly, stores 0002 and 0003 are in the 03064 and 20231 ZIP codes and in the states of New Hampshire and Washington, D.C., respectively.
A database user retrieves information from the tables of a relational database by entering input that is converted to queries by a database application, which submits the queries to a database server. In response to receiving the query, the database server accesses the tables specified in the query to determine which information within the tables satisfies the query. The information that satisfies the query is then retrieved by the database server and transmitted to the database application and ultimately presented to the user. Database servers are also capable of combining or “aggregating” information contained in different tables in response to a query. For example, one query for an exemplary database may be to list the number of stores having sales grouped by month and store.
For any given database application, the queries must conform to the rules of a particular query language. Most query languages provide users with a variety of ways to specify information to be retrieved. For example, in the Structured Query Language (SQL), the following query requests the retrieval of a list of the number of the stores having sales grouped by month and store:
[STATEMENT 1]
SELECT Time.month, Store. store, COUNT(*)
FROM Sales, Time, Store
WHERE Sales.date=Time.date AND Sales.store=Store.store
GROUP BY Time.month, Store.store;
This query performs a join operation on the sales table 100, the time table 110, and the store table 120. A join operation combines rows from two or more relational database objects, such as tables, views, or snapshots. To process a join operation, the database server combines pairs of rows that satisfy the join conditions and the other predicates. A join is performed whenever multiple tables appear in the FROM clause of a query. The SELECT list of the query can reference any of the columns from any of the base objects listed in the FROM clause. Most join queries contain a WHERE clause that contains a predicate that compares two columns, each from a different joined object. Such predicates are referred to as join conditions.
A “star query” is a specific type of query that is often used in a star schema, because techniques have been developed for the efficient processing of star queries. In particular, a star query is a query that contains one or more joins of a fact table with a corresponding dimension table, but none of the dimension tables are joined together, as specified in the join conditions. The exemplary query of STATEMENT 1 is a star query, because each dimension table, namely the time table 110 and store table 120, are joined to the fact table, i.e., sales table 100, but neither the time table 110 nor the store table 120 are joined to each other.
For various reasons, it is desirable to define views that present results of queries. For example, views are often used to provide security by hiding sensitive information or to simplify commands by hiding the complexity of the database. A view is a logical table, and as logical tables, views can be queried as if they were tables. The data that views actually present, however, is extracted or derived from other database objects, which may be tables, other views, or snapshots. Columns and tables that are mapped to a view are referred to herein as base columns and base tables or detail tables of the view, respectively.
A view is defined by metadata referred to as a view definition, which is typically in the form of a database query. For example, to create a view “RESULTS” for presenting the results of the query of STATEMENT1, the following STATEMENT2 may be issued to define the view:
[STATEMENT2]
CREATE VIEW RESULTS AS
SELECT Time.month, Store. store, COUNT (*)
FROM Sales, Time, Store
WHERE Sales.date=Time.Date AND Sales.store=Store.store
GROUP BY Time.month, Store.store;
A materialized view is a view for which a copy of the view data is stored separately from the detail tables of the materialized view. Since the view data is persistently stored, the view data is not required to be regenerated every time a query accesses the view, eliminating overhead associated with gathering and deriving the view. In order to provide up-to-date view data, however, materialized views must be maintained to reflect the current state of the detail tables. When the detail tables of a materialized view are modified, corresponding changes are made to the materialized view. Using a materialized view can lead to a cost savings compared with use of a conventional view when the materialized view presents a set of data that is infrequently changed but frequently accessed, especially when the defining query is costly to calculate.
Materialized views thus contain redundant information to improve query performance, because queries can be rerouted to reference the materialized views instead of the original set of tables. Materialized views typically contain GROUP BY and selection predicates. The sizes of the materialized views are often much smaller than those of the original tables referenced by the materialized views. Rerouted queries may have much better query response time, and the kernel of some relational database management systems route the queries according to structural features of the queries and the existing materialized views.
As discussed previously, the materialized view maintains pre-calculated data resulting from highly aggregated/joined queries. It has been increasingly and widely used in data warehouse applications to achieve high-performance query processing in accessing very large amounts of historical data. Its values can be seen in two aspects as query rewrite and incremental refresh. A query rewrite mechanism achieves better query processing performance by rewriting the given query in terms of the materialized view. The response time is thus shortened by using pre-calculated results in the materialized view. Incremental refresh of the materialized view provides a faster mechanism to synchronize the data between the detail tables and the materialized view.
Conventionally, the materialized view was “atomic” and treated as an inseparable object so that the status of its data could only be either globally fresh or globally stale, which has been a drawback that has affected the availability of the materialized view. For example, when any of the detail tables of the materialized view were updated, the materialized view immediately became globally “stale.” All the data in the materialized view became distrusted and unusable in the query rewrite (unless lowering the confidence level) since there was no easy way to identify which part of the data was affected by the changes.
In materialized view maintenance, row-based incremental refresh (using a log with changed data rows) has been a data synchronization approach not needing to recalculate the data of the materialized view (i.e., complete refresh). However, the row-change-based incremental refresh has limitations. For example, when the detail table is partitioned (e.g., by range or list of values) and a partition maintenance operation (e.g., exchange partitions) occurs, the conventional row-based incremental refresh is unable to handle such partition changes. As a result, the materialized view has been restored to the fresh state through a complete refresh that is very expensive especially with a large amount of data. The use of data partitioning has become very popular and has been increasingly adopted in Online Analytical Processing (OLAP) systems and is very suitable for managing historical data. Lacking an ability to incrementally refresh the materialized view after partition maintenance operations is a disadvantage.
In addition to the limitation on the partition maintenance operation, conventional row-based incremental refresh can not perform its work without extra information. For example, when multiple mixed types of Data Manipulation Language (DML) statements (i.e., delete, insert and update) and direct loads are performed on the partitioned base tables, row-based incremental refresh is not possible without change sequence information in the log. Such sequence information is an extra logging cost and is not always specified by the user.
Therefore, there is a need for a technique for refreshing materialized views that does not require a complete refresh, and does not rely on the log for change sequence information.