The present invention relates generally to the field of database systems. More particularly, the present invention relates to the field of query optimization for database systems.
Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns.
A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system. To enhance performance in processing queries, database servers use information about the data distribution to help access data in a database more efficiently. Typical servers comprise a query optimizer which estimate the selectivity of queries and generate efficient execution plans for queries. Query optimizers generate execution plans based on the data distribution and other statistical information on the column(s) of the table(s) referenced in the queries. For example, information about data distribution is used to approximate query processing, load balancing in parallel database systems, and guiding the process of sampling from a relation.
The increasing importance of decision support systems has amplified the need to ensure that optimizers produce query plans that are as optimal as possible. The quality of the optimizer is the most important factor in determining the quality of the plans. The query optimizer component of a database system relies on the statistics on the data in the database for generating query execution plans. The availability of the necessary statistics can greatly improve the quality of plans generated by the optimizer. In the absence of statistics, the cost estimates can be dramatically different, often resulting in a poor choice of execution plans. On the other hand, the presence of statistics that are not useful may incur a substantial overhead due to cost of creation and the cost of keeping them updated. As an example of the impact of statistics on quality of plans, consider a tuned TPC-D IGB database on Microsoft SQL Server 7.0 with 13 indexes and workload consisting of the 17 queries defined in the benchmark. In all but two queries, the availability of statistics resulted in improved execution cost.
Despite its importance, the problem of automatically determining the necessary statistics to build and maintain for a database has received little or no attention. The task of deciding which statistics to create and maintain is a complex function of the workload the database system experiences, the optimizer""s usage of statistics, and the data distribution itself.
Techniques for creating and maintaining only those statistics which are essential to query optimization of a given workload may be leveraged to automate statistics management in databases.
A method performed in accordance with one embodiment of the invention identifies statistics for use in executing one or more queries against a database. The method may be implemented by computer-executable instructions of a computer readable medium. A database system may perform the method with suitable means.
In accordance with the invention, a set of potentially relevant statistics is examined to determine if they may belong to a set of essential statistics for managing the database. A plurality of projected query costs are computed by assigning a range of selectivity values to the potentially relevant statistics and a set of essential statistics is formed based on the plurality of projected query costs. Additional statistics may be constructed if the plurality of projected query costs differ from each other by less than a predetermined threshold amount.
Statistics which are deemed non-essential may be added to a list of statistics to be dropped when an elimination criterion is met, for example when the cost of maintaining the statistic reaches a predetermined maximum.
If the plurality of projected query costs differ by more than the threshold amount a next statistic to be constructed may be selected based on a predetermined criteria, for example the relevancy of the statistic to relatively expensive operators.
For an embodiment of the method particularly suited to automated database management, an initial set of essential statistics is compiled by examining a set of potentially relevant statistics to determine if they may belong to a set of essential statistics for managing the database. A plurality of projected query costs are computed by assigning a range of selectivity values to the potentially relevant statistics and a set of essential statistics is formed based on the plurality of projected query costs. Additional statistics may be constructed if the plurality of projected query costs differ by less than a predetermined threshold amount from each other. Non-essential statistics are eliminated by identifying a subset of the initial set of statistics equivalent to the initial set of statistics with respect to a query. The subset and initial set may be determined to be equivalent if an execution plan for each query using the subset of statistics is the same as an execution plan for that query using the initial set of statistics and/or if a cost estimate to execute each query against the database using the subset of statistics is within a predetermined amount of a cost to execute that query against the database using the initial set of statistics.
The addition of statistics to the initial set of essential statistics and the elimination of non-essential statistics may be performed in real time after a predetermined number of queries or amount of time has elapsed. The addition of statistics to the initial set of essential statistics and the elimination of non-essential statistics may be performed off line on a workload log of stored previously executed queries. The addition of statistics to the initial set of essential statistics may be performed by assigning a probability of creation to a statistic based on the proportional amount of the query workload to which it is potentially relevant.