Relational databases store information in collections of tables, in which each table is organized into rows and columns. A popular application of relational database technology is data warehousing, in which an organization's data repository is designed to support the decision-making process for the organization. Many data warehouses are characterized by a “star schema,” which employs a very large table called a “fact table” and many smaller lookup tables called “dimension tables.” A fact table contains information collected for analysis, and dimension tables contain information about the attributes of the data in the large fact table.
FIG. 1 illustrates an exemplary database containing three tables, a sales table 100, a time table 110, and a store table 120, useful for recording and organizing information about a company's sales operation. In this example, the sales table 100 is a fact table, and the time table 110 and the store table 120 are dimension tables. The columns of the sales table 100 hold attributes for the sales made by the company, including the date in a date column 101, the store identifier in a store column 103, and the daily sales amount in a sales column 105. Information about the daily sales of each of the company's store is stored in a row of the sales table 100. For example, the first row has an entry on Jan. 2, 2001 (in the date column 101) for store 0001 (in the store column 103), which had sales of $5,500 (in the sales column 105). Similarly, the second row has an entry on Jan. 2, 2001 for store 0002, which had sales of $6,000, and the third row has an entry for sales of $5,000 on Jan. 2, 2001 for store 0003. The sales table 100 also has rows for the sales that occurred on Jul. 5, 2001 in the amounts of $4,500, $7,000, and $8,000 for stores 0001, 0002, and 0003, respectively.
The other tables, time table 110 and store table 120, are dimension tables, which characterizes the sales table 100's date column 101 and store column 103, respectively. The time table 110 holds information in columns that indicate how the time is to be measured. In this example, the time table 110 has a date column 111, a month column 113, and a quarter column 115. The date column 111 is a unique/primary key column that organizes each date in terms of its month and quarter; e.g. date Jan. 2, 2001 has a value of January in the month column 113 and a value of Q1 in the quarter column 115, and the date Jul. 5, 2001 is in the month of July and in the Q3 quarter. The store table 120 holds information in columns that indicate how the stores are geographically located. In this example, the store table 120 has a store column 121, a month column 123, and a state column 125. The store column 121 is a unique/primary key column that organizes each unique identifies each store, indicate the ZIP code of the store in the region column 123 and the state of the store in the state column 125; e.g. store 0001 is in the 94065 ZIP code (in the region column 123) and in California (in the state column 125). Similarly, stores 0002 and 0003 are in the 03064 and 20231 ZIP codes and in the states of New Hampshire and Washington, D.C., respectively.
A database user retrieves information from the tables of a relational database by entering input that is converted to queries by a database application, which submits the queries to a database server. In response to receiving a query, the database server accesses the tables specified in the query to determine which information within the tables satisfies the query. The information that satisfies the queries is then retrieved by the database server and transmitted to the database application and ultimately presented to the user. Database servers are also capable of combining or “aggregating” information contained in different tables in response to a query. For example, one query for the exemplary database is to list the number of stores having sales grouped by month and store.
For any given database application, the queries must conform to the rules of a particular query language. Most query languages provide users with a variety of ways to specify information to be retrieved. For example, in the Structured Query Language (SQL), the following query requests the retrieval of a list of the number of the stores having sales grouped by month and store:
[STATEMENT 1]SELECT COUNT(*)FROM SALES, TIME, STOREWHERE SALES.DATE = TIME.DATE AND SALES.STORE = STORE.STOREGROUP BY TIME.MONTH, STORE.STORE;
This query performs a join operation on the product table 500 and the sales table 510. A join operation combines rows from two or more relational database objects, such as tables, views, or snapshots. To process a join operation, the database server combines pairs of rows that satisfy the join conditions and the other predicates. A join is performed whenever multiple tables appear in the FROM clause of query. The SELECT list of the query can reference any of the columns from any of the base objects listed in the FROM clause. Most join queries contain a WHERE clause that contains a predicate that compares two columns, each from a different joined object. Such predicates are referred to join conditions.
A “star query” is a specific kind of query that is often used in a star schema, because techniques have been developed for the efficient processing of star queries. In particular, a star query is a query that contains one of more joins of a fact table with a corresponding dimension table, but none of the dimension tables are joined together, as specified in the join conditions. The exemplary query of STATEMENT 1 is a star query, because each dimension table, namely the time table 110 and store table 120, are joined to the fact table, i.e. sales table 100, but neither the time table 110 nor the store table 120 are joined to each other.
For various reasons, it is desirable to define views that present results of queries. For example, views are often used to provide security by hiding sensitive information or simplify commands for a user by hiding the complexity of the database. A view is a logical table, and as logical tables, views can be queried just as if they were tables. The data that views actually present, however, is extracted or derived from other database objects, which may in fact be tables, other views, or snapshots. Columns and tables that are mapped to a view are referred to herein as base columns and base tables of the view, respectively.
A view is defined by metadata referred to as a view definition, which is typically in the form of a database query. For example, to create a view “RESULTS for presenting the results of the query of STATEMENT 1, the following STATEMENT 2 may be issued to define the view:
[STATEMENT 2]CREATE VIEW RESULTS ASSELECT COUNT(*)FROM SALES, TIME, STOREWHERE SALES.DATE = TIME.DATE AND SALES.STORE = STORE.STOREGROUP BY TIME.MONTH, STORE.STORE;
A materialized view is a view for which a copy of the view data is stored separately from the base tables of the materialized view. Since the view data is persistently stored, the view data is not required to be regenerated every time a query accesses the view, eliminating overhead associated gathering and deriving the view. In order to provide up-to-date view data, however, materialized views must be maintained to reflect the current state of the base tables. When the base tables of a materialized view are modified, corresponding changes are made to the materialized view. Using a materialized view can lead to a cost savings compared with use of a conventional view when the materialized view presents a set of data that is infrequently changed but frequently accessed, especially when the defining query is costly to calculate.
Materialized views thus contain redundant information to improve query performance, because queries can be rerouted to reference the materialized views instead of the original set of tables. Materialized views typically contain GROUP BY and selection predicates. The sizes of the materialized views are often much smaller than those of the original tables referenced by the materialized views. Rerouted queries may have much better query response time, and the kernel of some relational database management systems route the queries according to structural features of the queries and the existing materialized views.
Given a small set of tables, there are many ways to create materialized views to handle the various queries submitted by users, but it is not practical to create materialized views to handle every possible query because materialized views occupy disk storage and take time to build. After the original tables are updated, all of the materialized views based on the original tables must be refreshed, consuming valuable computation resources. For a given workload (e.g. a pattern exhibited by queries submitted by users in a particular deployment environment such as a customer site), different sets of materialized views have different cost/benefit tradeoffs. Accordingly, there is a need for applying workload query information to select a set of materialized views to improve the overall query performance of the workload within given storage constraints.
Previously, identifying which materialized views to create that nearly optimize the workload's query performance within the storage constraints was performed by a stand-alone tool that performs the steps shown in FIG. 2. A step 201, the tool generates a search space based on a fact table and a set of dimension tables submitted by a user. The search space is composed of queries referencing the fact table and zero or more dimension tables. All the dimension tables in these queries are joined either to the fact table or to other dimension tables through foreign keys. Furthermore, queries in the search space differ by groupings (e.g. in the GROUP BY clause). The total number of queries in a search space may be characterized by the formula: π(Li+1), where i iterates from 1 to n, the number of dimensions referenced in the fact table, Li is the number of levels for the ith dimension.
In the example illustrated in FIG. 1 and STATEMENT 1, the fact table is sales table 100, and the dimension tables are time table 10 and store table 120. The time dimension has three levels: date 111, month 113, and quarter 115, and the store dimension also has three levels: store 121, region 123, and state 125. The search space takes the form of a lattice, which, as shown in FIG. 3 for this example, has the sixteen points 300< >, 301<STATE>, 303<REGION>, 305 <STORE>, 310<QUARTER>, 311<QUARTER, STATE>, 313<QUARTER, REGION>, 315 <QUARTER, STORE>, 320<MONTH>, 321<MONTH, STATE>, 323<MONTH, REGION>, 325<MONTH, STORE>, 330<DATE>, 331<DATE, STATE>, 333<DATE, REGION>, 335 <DATE, STORE>. Each point represents a workload query's grouping condition. For example, the grouping condition in the query of STATEMENT 1 is <MONTH, STORE>, which corresponds to point 325 of the search space. The arrow between a pair of points representing queries indicates that the query on the “from” side, if materialized, can be used to answer a query on the “to” side. If the size of the “from” query is much smaller than the size of the fact table, the query response time of the “to” query can be dramatically improved by accessing a pre-computed materialized view based on the “from” query.
For example, a materialized view that was created based on the query of STATEMENT 1, grouping by month and store, can be used to answer queries that group by month and region (point 323) or that group by quarter and store (point 315). This relationship is transitive, such that the materialized view based on the query at point 325<MONTH, STORE> can be used by queries corresponding to any point ultimately reachable from point 325<MONTH, STORE> by one or more arrows. For example, a query at point 320<MONTH>, grouping only by month, can use the materialized view based on the query of STATEMENT 1, which groups by month and store. The null query at point 300< >, which has no grouping, can use any materialized view in FIG. 3, while a query with the most specific grouping, that by date and store, can only use a materialized view corresponding to point 335<DATE, STORE>, which for all intents and purposes tantamount to the original fact table 100.
At step 203, the tool maps the workload queries onto the search space. For example, if a workload query groups by month and store, then that query is mapped to the <MONTH, STORE> point 325. The points in the search space lattice are not all of equal importance because not all workload queries are of equal importance. Some workload queries are more frequent than others. Accordingly, a counter is maintained for each point in the search space lattice to record the total frequencies of the workload queries. Given the strict format of the queries in the search space, the tool can only map star queries onto the search space.
At step 205, existing materialized views are also mapped to the search space. This is to account for the fact that it is generally less expensive to keep an existing materialized view than generate a new materialized view. These materialized views are evaluated for their usefulness in satisfying the previously mapped incoming queries. Specifically, the tool attempts to minimize the average time taken to evaluate the workload queries within the constraint that there is a fixed number of materialized views or a fixed, total amount of space for any number of materialized views. This optimization problem is known to be NP-complete, which makes it cost-prohibitive to exhaustively search all of the possible combinations of materialized views for the best performance within the constraints.
Accordingly, at step 207, heuristics are employed to produce approximate, near-optimal solutions that terminate much sooner than the exhaustive search. One such heuristic is the greedy algorithm described by V. Harinarayam, Anand Rajaraman, Jeffrey D. Ullman, “Implementing Data Cubes Efficiently” in Proc. ACM SIGMOD 1996 (Montreal: ACM, June 1996) pp. 205–216, the contents of which are incorporated by reference in their entirety herein. The Harinarayam et al. greedy algorithm has a complexity of O(n2), where n is the number of points in the search space, and selects a subset of points as candidate for materialized views that delivers a near-optimal performance/cost ratio from a given search space.