This invention relates generally to the field of databases, and in particular to index and materialized view selection for a given workload of queries to a database.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawing hereto: Copyright(copyright) 2000, Microsoft Corporation, All Rights Reserved.
Databases have grown very large in size. When a user desires information, they issue a query, which asks for information, such as give me the names of people from Bozeman, Mont. with income greater than a certain amount. There may be a table in a database containing a row for each person in the United States. Each row may have several columns, including income, social security number, street address, city, state, telephone number, names of dependents, occupation, etc. By searching this large table, a computer can find all the people in Bozeman, Mont., and then determine from that list of people, who has income greater than the specified amount. To do this search, the computer has to deal with the entire large table, retrieving it from storage into memory, and then searching row by row for the desired information.
One way to make this type of search more efficient, is to build indexes. An index is a subset of a table, which typically contains fewer columns than the table itself. Indexes are sometimes created prior to a user query being made. Some indexes are arranged in a tree like structure, such as a B+ tree, which makes finding information even faster. If an index exists that contains the data that the user desires, it is much easier to simply search the index to provide the information.
In the case of the above query, an index could be generated based on State, and then people in Montana could be quickly identified. This index would be much smaller than the table, and the answer could be provided much more quickly because the entire table does not need to be searched. An index with income can also be generated to quickly identify people with income greater than the specified amount. An index on city could also be created to quickly identify people in the city of Bozeman. In each of these cases, the table would have to be consulted absent further relevant columns in the index.
A covering index for a query may be used. A covering index for a query is an index that is a subset of the large table, yet has at least all the columns from a table needed to answer the query. For the example query, an index having city, state, people and income columns would be a covering index because it has all the columns needed to answer the query without resort to the table.
A further way to improve the performance of a query is to use a materialized view. A materialized view is sometimes referred to as an indexed view in some database products. It is some view of the data, such as the results of a query, which have been materialized. A materialized view may not be directly tied to a query. A materialized view has some characteristics that a traditional index does not have. Materialized views may be defined over multiple tables, and can have selections and group-by over multiple columns. It can have selection and grouping of columns. As with traditional indexes, materialized views also incur an overhead each time the tables referenced in the materialized view are updated. It can also provide a dramatic performance improvement. On the other hand, a traditional index may be applicable more broadly to different queries in the workload.
Using the example above, a very good materialized view would comprise a list of people in Bozeman, Mont. with at least the specified income. However, if a new person moves to Bozeman, Mont. having at least the specified income, the materialized view needs to be updated.
A workload is a set of queries and updates which are run against a given database. A configuration is a set of traditional indexes, materialized views and indexes on materialized views which may be used in the execution of the workload. Given a workload of multiple queries, the decision as to which indexes and materialized views to include in a configuration of indexes and materialized views is very complex and time consuming. Since there is overhead associated with generating, maintaining and storing the indexes and materialized views, this must be offset against the benefit obtained by using them.
A query optimizer is used to obtain information about indexes and materialized views given a set of queries. Optimizers are database components that generate an optimized plan for each query and an associated cost for that query. They can take a representative workload, and return information of a plan of execution for each query in the workload, as well as information about the cost of execution. The optimizer provides detailed information in the form of a tree which has nodes corresponding to a plan of execution of the query. The nodes provide information on the data structures used, and the cost of using them. From this cost information, a user can try to determine which indexes and materialized views should be built or used to enable the workload to be processed in an efficient manner.
One way of determining which set of indexes and materialized views provides the most benefit given a workload is to actually propose a set of indexes and materialized views, build them, and then run the workload. The total cost of the workload is then calculated. Several different sets of indexes and materialized views are measured in this manner to find the best set. This is a very expensive proposition since the space of materialized views and indexes is extremely large. In addition, building them can be very expensive due to updates and storage constraints. Selecting materialized views is also intertwined with the selection of indexes, adding to complexity. As the workload changes, the entire process must be repeated to arrive at a new set of indexes and materialized views and indexes on materialized views.
Prior attempts at materialized view selection typically assume that a set of materialized views which are likely to be of benefit is pre-specified, and they focus only on the xe2x80x9csearchxe2x80x9d problem of picking an attractive set of materialized views among these views. The attractive set of materialized views can be referred to as a set of candidate materialized views. Candidate materialized views may be thought of as a set of materialized views such that a search on them preserves most of the gains of searching the entire space of possible materialized views at a fraction of the cost. The prior attempts assume that the space of candidate materialized views is synonymous with the space of syntactically relevant materialized views, which are essentially all materialized views which could be used for the workload. This approach is not scalable for reasonably large workloads since the space of syntactically relevant materialized views is so large that known materialized view enumeration schemes are computationally infeasible. Moreover, most previous work does not consider materialized views with selections. Adding to the complexity of selecting materialized views, is that they may be defined over multiple tables, and can have selections and group-by over multiple columns.
Finally, the prior attempts focus exclusively on materialized views but do not address a key issue of balancing the choice of indexes and materialized views. For any single query, a materialized view that matches the query can speed up execution of the query dramatically, and an index on the materialized view can further speed up execution of the query. On the other hand, a traditional index may be applicable more broadly, and of benefit to multiple different queries. Thus, it is difficult to strike that balance and select the right combination of indexes and materialized indexes and indexes on materialized views.
A set of interesting table-subsets to consider for a configuration for a database workload are generated. Materialized views are proposed on a per query basis by first determining subsets of tables that are referenced by queries in the workload. Uninteresting subsets are pruned or eliminated based on cost. The remaining subsets are then pruned based on a table subset weight metric which is proportional to the size of the table subset divided by the size of all tables referenced in the query, and multiplied by cost of the corresponding query. This pruning results in a set of interesting table subsets. Next, interesting table subsets are considered on a per query basis to determine syntactically relevant materialized views to propose for a query.
Given the set of syntactically relevant materialized views, an optimizer is used for each to obtain a set of materialized views that are likely to be used to answer the query. This set of materialized views is provided to an enumerator together with a set of indexes for selection of a configuration to recommend for the workload.