The present invention relates to database systems, and in particular, to rewriting queries to access data sources that are not specifically referenced in the queries.
In a database management system (DBMS), data is stored in one or more data containers, each container contains records, and the data within each record is organized into one or more fields. In relational database systems, the data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns. In object oriented databases, the data containers are referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes. Other database architectures may use other terminology.
The present invention is not limited to any particular type of data container or database architecture. However, for the purpose of explanation, the examples and the terminology used herein shall be that typically associated with relational databases. Thus, the terms xe2x80x9ctablexe2x80x9d, xe2x80x9crowxe2x80x9d and xe2x80x9ccolumnxe2x80x9d shall be used herein to refer respectively to the data container, record, and field.
For various reasons, it may not be desirable for certain users to have access to all of the columns of a table. For example, one column of an employee table may hold the salaries for the employees. Under these circumstances, it may be desirable to limit access to the salary column to management, and allow all employees to have access to the other columns. To address this situation, the employees may be restricted from directly accessing the table. Instead, they may be allowed to indirectly access the appropriate columns in the table through a xe2x80x9cviewxe2x80x9d.
A view is a logical table. As logical tables, views may be queried by users as if they were a table. However, views actually present data that is extracted or derived from existing tables. Thus, the problem described above may be solved by (1) creating a view that extracts data from all columns of the employee table except the salary column, and (2) allowing all employees to access the view.
A view is defined by metadata referred to as a view definition. The view definition contains mappings to one or more columns in the one or more tables containing the data. Typically, the view definition is in the form of a database query. Columns and tables that are mapped to a view are referred to herein as base columns and base tables of the view, respectively. The data maintained in the base columns is referred to herein as base data.
The data presented by conventional views is gathered and derived on-the-fly from the base tables in response to queries that access the views. That data gathered for the view is not persistently stored after the query accessing the view has been processed. Because the data provided by conventional views is gathered from the base tables at the time the views are accessed, the data from the views will reflect the current state of the base tables. However, the overhead associated with gathering the data from the base tables for a view every time the view is accessed may be prohibitive.
A materialized view, on the other hand, is a view for which a copy of the view data is stored separate from the base tables from which the data was originally gathered and derived. The data contained in a materialized view is referred to herein as (xe2x80x9cmaterialized dataxe2x80x9d). Materialized views eliminate the overhead associated with gathering and deriving the view data every time a query accesses the view. As new data is periodically added to the base tables, the materialized view needs to be updated (i.e., refreshed) to reflect the new base data.
Materialized views that are derived from more than one base table are created by performing a join between the base tables. A join is a query that combines rows from two or more tables, views, or materialized views. A join is performed whenever multiple tables appear in a query""s FROM clause. The query""s select list can select any columns from any of the base tables listed in the FROM clause.
Most join queries contain WHERE clause conditions that compare two columns, each from a different table. Such a condition is called a join condition. To execute a join, the DBMS combines pairs of rows for which the join condition evaluates to TRUE, where each pair contains one row from each table.
To execute a join of three or more tables, the DBMS first joins two of the tables based on the join conditions comparing their columns and then joins the result to another table based on join conditions containing columns of the joined tables and the new table. The DBMS continues this process until all tables are joined into the result.
In addition to join conditions, the WHERE clause of a join query can also contain other conditions that refer to columns of only one table. These conditions can further restrict the rows returned by the join query.
An equijoin is a join with a join condition containing an equality operator. An equijoin combines rows that have equivalent values for the specified columns. Query1 is an equijoin that combines the rows of tables R and S where the value in column r.a is the same as the value in column s.a:
SELECT *
FROM R, S
WHERE r.a=s.a;
In this example, table R is the xe2x80x9cleftxe2x80x9d or xe2x80x9cchildxe2x80x9d side table of the join, and table S is the xe2x80x9crightxe2x80x9d or xe2x80x9cparentxe2x80x9d table of the join. The join illustrated by Query1 is a xe2x80x9csimplexe2x80x9d or xe2x80x9cinnerxe2x80x9d join. With an inner join, rows from the child table that do not satisfy the join condition are not reflected in the join result. In contrast, an outer join returns all child rows that satisfy the join condition and those rows from the child table for which no rows from the parent satisfy the join condition.
Computer database systems that are used for data warehousing frequently maintain materialized views that contain pre-computed summary information in order to speed up query processing. Such summary information is created by applying an aggregate function, such as SUM, COUNT, or AVERAGE, to values contained in the base tables. Materialized views that contain pre-computed summary information are referred to herein as xe2x80x9csummary tablesxe2x80x9d or more simply, xe2x80x9csummariesxe2x80x9d.
Summary tables typically store aggregated information, such as xe2x80x9csum of PRODUCT_SALES, by region, by month.xe2x80x9d Other examples of aggregated information include counts of tally totals, minimum values, maximum values, and average calculations. Summary tables are used to reduce the overhead associated with processing queries that request summary information. Because the summary information is already pre-computed, it need not be re-computed during the execution of each query that requests summary information. Instead, the pre-computed summary values can be directly retrieved from the summary columns of the summary table.
When a database user knows that a particular materialized view contains the data desired by the user, the user can formulate a query that extracts the desired data directly from that materialized view. However, there are numerous circumstances in which a user will design a query that does not reference a materialized view to retrieve a set of data that can be most efficiently retrieved by accessing the materialized view. For example, the user may not be aware of all of the materialized views available in the database, or may not be sure how a particular materialized view could be used to retrieve the desired information.
Even when the user is aware that materialized views are available that have data desired by the user, the user may prefer to have the server transparently rewrite the query in terms of the materialized views, rather than writing the query to directly reference the materialized views. By allowing the server to rewrite the query, the database administrator may easily add and drop materialized views without affecting the application code that imbeds the user query. For example, if a user query is rewritten by the server to use a particular materialized view MV1, then the application will not break if the database administrator decides to drop MV1. In contrast, if the user query directly references MV1 in the application code, then dropping MV1 will break the application.
In light of the foregoing, some database servers include mechanisms for rewriting queries that do not reference materialized views in a way that causes them to reference materialized views. The execution of the rewritten query is often considerably improved relative to the original query because the relational objects accessed by the rewritten query are (e.g. the materialized views) much smaller than the objects referenced in the original query (e.g. the base tables), and/or the number of objects used by the rewritten query is less than the number of objects referenced in the original query.
For example, a summary can correspond to a query which joins two tables and performs data aggregation to compute sum-of-sales by city, and stores the materialized result in a table. If a user issues a query Q which requests sum-of-sales by city, then Q can be transparently rewritten to access pre-computed data stored in the summary table. Because of the rewrite, the result for Q can be quickly produced by simply scanning the summary table instead of joining two tables and then aggregating the data.
Currently, database systems that include query rewrite mechanisms perform a series of tests on an incoming query to determine whether the query can be rewritten to access a particular materialized view. The tests include tests for (1) join compatibility and (2) data sufficiency. For summary tables, the tests additionally include tests for (3) grouping compatibility and (4) aggregate computability.
With respect to join compatibility, the test is passed if all joins in a materialized view match exactly with joins in a query, with possibly zero or more additional joins in the query. For example, if a materialized view joins tables A and B, and the query requires a join of A, B and C, then the join compatibility test is passed. However, if the materialized view joins tables A, B and D, and the query requires a join of A, B and C, then the join compatibility test fails.
With respect to data sufficiency, the test is passed if all necessary columns of a query can be obtained using the data stored in the materialized view. If values from a column necessary for the processing of a query are not contained in the materialized view, then the data sufficiency test fails.
With respect to grouping compatibility, the test is passed if the aggregates in a summary are at the same or lower level than the level of aggregates in a query. For example, a sum of sales by month is at a lower level than a sum of sales by year. Therefore, the grouping compatibility is passed if the summary has sum of sales by month and the query asks for sum of sales by year, but is failed if the summary has sum of sales by year and the query asks for sum of sales by month.
With respect to aggregate computability, the test is passed if each aggregate required by the query is computable from one or more aggregates in the summary. For example, the aggregate computability test is passed if the summary contains sum and count values, and the query requires an average. The aggregate computability test fails if the summary only contains sum, and the query requires an average.
If all of these tests are passed, then the rewrite mechanism determines that the received query can be rewritten to reference the materialized view in question.
Unfortunately, as these tests are currently applied, database servers often conclude that materialized views cannot be used to process a query when in fact they could be. In fact, database servers may conclude that a query cannot be rewritten to access a materialized view when use of the materialized view would actually be the most efficient way to process the query.
Based on the foregoing, it is clearly desirable to provide a query rewrite mechanism that is capable of rewriting queries to access materialized views that would otherwise have been considered ineligible by conventional rewrite mechanisms.
A method and system is provided for processing queries. According to one aspect of the invention, techniques are provided for handling a query that does not reference a particular materialized view, where the query requires access to values from a particular column not contained in the materialized view. Under these conditions, it is determined whether the particular materialized view satisfies each condition in a set of conditions, the set of conditions at least including a condition that the particular column is functionally dependent on a column that does belong to the materialized view; and if the materialized view satisfies each condition in the set of conditions, then the query is rewritten to produce a rewritten query that references the materialized view and joins the materialized view back to a join-back table that contains the particular column.
According to another aspect of the invention, a technique is provided for processing a query that does not reference a particular materialized aggregate view, where the materialized aggregate view specifies an outer join between a child table and a parent table and the query specifies a particular type of join between the child table and the parent table, where the particular type of join is one of an inner join, an anti-join and a semi-join. The query is rewritten to produce a rewritten query that accesses the materialized aggregate view to produce data required by the query.
According to yet another aspect of the invention, a technique is provided for processing a query that does not reference a particular materialized view and that specifies that results are to be grouped by a first set of one or more columns, where the materialized view reflects data that is grouped by a second set of one or more columns. Under these conditions, it is determined whether each column in the first set of columns either
a) exactly matches a column in the second set of columns,
b) is functionally dependent on another column in the first set of columns, or
c) is functionally dependent on a column in the second set of columns.
If each column in the first set of columns either exactly matches a column in the second set of columns, is functionally dependent on another column in the first set of columns, or is functionally dependent on a column in the second set of columns, then the query is rewritten to produce a rewritten query that references the materialized view.
By using the techniques described herein, more queries can take advantage of the set of materialized views that exist in the database, thus resulting in improved query execution speeds. Conversely, the number of materialized views that a database must contain in order to efficiently process a particular set of queries is reduced, thus reducing the size of the database and eliminating the overhead associated with maintaining the materialized views that are no longer necessary.