This invention relates generally to the field of computers, and in particular to a database query optimizer that selects materialized views on a cost basis.
A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawing hereto: Copyright (copyright) 2000, Microsoft Corporation, All Rights Reserved.
Relational databases are collections of rows or tuples of data. Each row may have one or more columns containing information, such as numbers, names, addresses, etc. For example, a column might contain the names of employees, employee ID, address, phone, sales for each day, and other information. This information is stored in a table in the database. All the information in a row in this particular table is related to the same person. A query can be written, requesting information from the table. One such query might be related to sales. The query could be related to finding the sales for each employee on a certain day.
Materialized views have been a subject of database research for over a decade. The basic idea is to materialize, or store, the result of some query, then use such computed result when similar queries are submitted to the database. For example, we may want to store the result of, say, sales per day, and use the stored results to answer several related queries, including sales in a given month, or total sales in the year.
For maximum flexibility, applications should not need to be aware that certain views exist, or are materialized. A query processor should identify matches between user queries and existing pre-computed results, and use such results when applicable. This is known as the view utilization problem: Given a user query written over base tables, as well as a collection of materialized views, which materialized views can be used to answer such query? One must then determine which of the views should be used.
To guarantee transactional correctness, the contents of a view must be kept in synch with respect to changes in base tables. For example, when orders are entered, or modified, the materialization of sales per week has to be updated to reflect the changes. This is known as the view maintenance problem.
Materialized views are similar to indices, in that they should be part of the physical design of the database and their primary purpose is to improve performance. The logical design of the database, and correctness of applications should be independent of the presence or absence of materialized views. As with indices, materialized views can introduce dramatic improvements in query performance.
Prior attempts to determine which views should be used treat the problem in isolation, handle limited scenarios, and often assume a xe2x80x9cglobalxe2x80x9d structure that covers the whole query. This is useful in providing xe2x80x9ccoverage results,xe2x80x9d e.g. given a view with this set of predicates and a query of this form, use this algorithm to decide if the view can be used to answer the query. There is a need to deal with arbitrary queries, and to integrate view utilization within the actual architecture of query optimizers. There is a further need to address the question of, once you know the view can be used to answer the query, should it be used.
Constructing a xe2x80x9cglobalxe2x80x9d structure for the user query, for the purpose of view matching, is incompatible with common optimizer architecture, and it is sometimes impossible, when a xe2x80x9cdisallowedxe2x80x9d construct is used. On some complex queries, view utilization will be possible only in sub-expressions of the complete query. Furthermore, those sub-queries may appear only after some reordering has taken place, which is naturally done in the process of exploration of alternatives.
A cost based query optimizer determines the applicability of materialized views to a query. View utilization alternatives are generated in the exploration stage of optimization, so that interaction with other transformations in complex queries is taken into account. A final decision on whether to use a materialized view is based on estimated cost.
The optimizer generates a table of alternatives, which compactly encodes the various possibilities for each sub-expression of the query. Operator trees are implicitly represented in the table. Materialized views are detected and substituted during exploration of the various possibilities and added to the table of alternatives. Materialized views and the alternatives are selected for use in a query execution plan based on cost.
The optimizer is used to estimate cost, prune out expensive solutions, assemble operator trees and construct an optimal solution. For a given query, there are a general number of candidate views as well as a number of table entries on which utilization of the views can be considered. Information, such as the tables referenced in the query, or whether or not the query contains aggregations are used to determine views that could be relevant.
To reduce the number of operator trees considered, collapsed operator trees are generated to form query graphs that basically list all underlying tables along with the predicates that are applied on them, called a query graph. A view that matches a query graph is added to the table of alternatives. A primitive graph tree is also extracted from the table of alternatives. Such a primitive graph tree allows only a specific set of operators, and only allows primitive tables. This allows view tables to be ignored in favor of primitive database tables.
Two operator trees need not be identical. A residual operator can be used if one operator tree subsumes the other operator tree. The residual expression can contain filters, group by and join operators.