The present invention relates generally to the field of database systems. More particularly, the present invention relates to the field of query optimization for database systems.
Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns.
A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system. To enhance performance in processing queries, database servers use indexes to help access data in a database more efficiently. Typical database servers comprise a query optimizer to generate efficient execution plans for queries with respect to a set of indexes. Query optimizers typically rely on statistics to derive cost estimates and choose among possible execution plans for a query. The availability of statistics can greatly improve cost estimation and the quality of execution plans chosen by the query optimizer.
Statistics may be created and maintained on a table, an index, a single column of a table, or combinations of columns of a table, although the structure of statistics may vary from system to system. Single column statistics typically comprise a histogram of values in the domain of that column and may include one or more of the following parameters: the number of distinct values in the column, the density of values in the column, and the second highest and the second lowest values in the column. Multi-column statistics typically represent information on the distribution of values over the Cartesian product of the domains in it. As one example, multi-column statistics on (R2.c, R2.d) may contain information on the joint distribution of values over R2.c and R2.d. In Microsoft(copyright) SQL Server, for example, such multi-column statistics would contain joint density information and a histogram on the leading dimension R2.c. The single and multi-column statistics available for a database make cost estimation significantly more accurate and help the query optimizer arrive at better query execution plans. In the absence of statistics, cost estimates can be dramatically different often resulting in a poor choice of the execution plan.
Creating as well as maintaining statistics, however, can incur significant costs in storage, time, and memory, particularly for large databases. The space of single and multi-column statistics can be very large because many combinations of columns are possible. Statistics can therefore consume significant amounts of secondary storage. Also, to remain effective, statistics need to be updated as the data in the database changes. The cost of updating statistics on columns of large tables can be substantial. Updating statistics on a column requires scanning the table for values in that column and sorting the values to produce, for example, a histogram and other statistics. Furthermore, the query optimizer loads all potentially relevant statistics for a query into memory during optimization. Multiple users concurrently running against a database server, for example, can incur significant costs in CPU time and memory for loading statistics.
An essential statistics identification utility tool attempts to reduce or minimize the overhead associated with statistics by identifying from an initial set of statistics a set of essential statistics that provide a query optimizer with the ability to choose among query execution plans with minimized loss in accuracy as compared to using the initial set of statistics.
A method identifies statistics for use in executing one or more queries against a database. The method may be implemented by computer-executable instructions of a computer readable medium. A database system may perform the method with suitable means.
For the method, an initial set of statistics is identified. A subset of the initial set of statistics equivalent to the initial set of statistics with respect to each query is then identified. The subset of statistics is equivalent to the initial set of statistics if an execution plan for each query using the subset of statistics is the same as an execution plan for that query using the initial set of statistics and/or if a cost estimate to execute each query against the database using the subset of statistics is within a predetermined amount of a cost estimate to execute that query against the database using the initial set of statistics. The predetermined amount may be a predetermined percentage of the cost estimate to execute that query against the database using the initial set of statistics. Also, the predetermined amount may be zero.
The subset of statistics may be identified such that any proper subset of the subset of statistics is not equivalent to the initial set of statistics with respect to each query and/or such that an update cost or size for the subset of statistics is minimized.
The subset of statistics may be identified by removing statistics from the initial set of statistics if the remaining subset of statistics is equivalent to the initial set of statistics with respect to each query.
The one or more queries may be identified from a workload of queries as the one or more queries that have estimated execution costs greater than any other query of the workload and that account for at least a predetermined percentage of a total estimated execution cost for the workload. Each query of the workload may be identified such that the subset of statistics is not equivalent to the initial set of statistics with respect to that query, and another subset of statistics may be identified by removing statistics from the initial set of statistics if the remaining subset of statistics is equivalent to the initial set of statistics with respect to each such identified query.
For each statistic of the initial set of statistics, a respective set of queries may be identified from a workload of queries such that that statistic is potentially relevant to each query in the respective query set and such that each query in the respective query set has estimated execution costs greater than any other potentially relevant query of the workload. For each statistic of the initial set of statistics, whether the initial set of statistics without that statistic is equivalent to the initial set of statistics with respect to each query in the respective query set may then be determined, and, if not, that statistic is included in a first subset of statistics. The one or more queries may then be identified from the workload as each query of the workload such that the first subset of statistics is not equivalent to the initial set of statistics with respect to that query.
The subset of statistics may be identified by identifying a subset of the initial set of statistics, determining whether such an identified subset of statistics is equivalent to the initial set of statistics with respect to each query, and repeating these steps for other subsets of the initial set of statistics. These steps may be repeated until an identified subset of statistics is equivalent to the initial set of statistics with respect to each query. Subsets of the initial set of statistics may be identified in increasing order of update cost or size.
The subset of statistics may be identified by identifying a seed subset of the initial set of statistics, adding a subset of the remaining subset of the initial set of statistics to the seed subset to produce a current subset of statistics, determining whether the current subset of statistics is equivalent to the initial set of statistics with respect to each query, and repeating these steps for other subsets of the remaining subset of the initial set of statistics. These steps may be repeated until the current subset of statistics is equivalent to the initial set of statistics with respect to each query. The seed subset may be identified by determining whether the initial set of statistics without any one statistic of the initial set of statistics is equivalent to the initial set of statistics with respect to each query, and, if not, including the one statistic in the seed subset. Subsets of the remaining subset of the initial set of statistics may be identified in increasing order of update cost or size.
The subset of statistics may be identified by partitioning statistics of the initial set of statistics into partitions such that each partition comprises statistics on a respective table of one of the one or more queries, identifying a subset of each partition such that the combined partition subsets are equivalent to the initial set of statistics with respect to the one query, and repeating these steps for each query. Each partition subset may be identified such that an update cost or size for each partition subset is minimized.
The subset of statistics may be identified by identifying a seed subset of the initial set of statistics, partitioning statistics of the remaining subset of the initial set of statistics into partitions such that each partition comprises statistics on a respective table of one of the one or more queries, identifying a subset of each partition such that the combined partition subsets and seed subset are equivalent to the initial set of statistics with respect to the one query, and repeating these steps for each query. The seed subset may be identified by determining whether the initial set of statistics without any one statistic of the initial set of statistics is equivalent to the initial set of statistics with respect to each query, and, if not, including the one statistic in the seed subset. Each partition subset may be identified such that an update cost or size for each partition subset is minimized.