The present invention relates generally to database systems, and more particularly to a system for automatically maintaining statistics for data stored in a database.
A relational database is a collection of related data that is organized in related two-dimensional tables of columns and rows. The data stored in a relational database is typically accessed by way of a user-defined query that is constructed in a query language such as Structured Query Language (SQL). A SQL query is non-procedural in that it specifies the objective or desired result of the query in a language meaningful to a user but does not define the steps to be performed, or the order of the steps in order to accomplish the query.
When a SQL query is applied to a database, the relational database management system (RDBMS) processes the non-procedural query and develops a plurality of procedural execution plans, any one of which could be employed to implement the objectives of the SQL query. While each of the execution plans may fulfill the objectives of the SQL query, each has a different cost (in terms of processing cycles and logical I/Os) associated with it. The relational database management system attempts to select the execution plan which is most efficient, i.e. has the least overall cost. Generally, the selection is made on the basis of statistics that are derived from the data in the database and maintained by the RDBMS. For example, a database may record the number of distinct values in a pre-designated index column along with a histogram of the values distributed across the column. Using these and other statistics, the database system makes an educated guess of the cost associated with each execution plan, selects the most efficient plan, and stores the selected plan in an area of memory sometimes referred to as an execution plan cache. On subsequent invocations of the same SQL query, rather than re-traverse the process of identifying the most efficient execution plan, the database system refers to the execution plan cache and selects the previously identified optimal execution plan.
Data in a database may frequently change. Therefore, in order for the RDBMS to effectively select the most efficient execution plan, it is necessary that the database statistics be periodically updated to reflect changes to the data. In existing systems, statistics are updated by the database system in response to commands input manually by a database administrator. In enterprises with substantial resources, relying upon a database administrator to update statistics may be acceptable. For many smaller organizations, however, resources are often not available to devote to this task.
Furthermore, in existing systems, when a command to update statistics is issued, all statistics for all indices in the database are updated regardless of whether there has been an update to the underlying data. This brute force method of updating statistics is inefficient and typically requires a significant amount of processing power. Indeed, the processing power required to update all statistics in a database may burden the system such that other database tasks are negatively impacted. For this reason, in many organizations, rather than update statistics as needed to reflect changes to the underlying data, updates are scheduled for periods of low database usage, typically during off-peak hours. Thus, a database may be allowed to operate with out-of-date statistics and at a less than efficient state until such time arrives that is opportune to run the statistics update.
An additional complication resulting from changing data arises from outdated or xe2x80x9cstalexe2x80x9d execution plans. As previously noted, after an optimal execution plan has been identified, the execution plan is stored in a plan cache so that it can be retrieved and executed in the future. Execution plans that were generated using statistics that have since been updated may no longer represent the optimal execution plan for a particular SQL query. In existing systems, there is no provision for updating these outdated execution plans. Rather, in existing systems execution plans may remain in cache and be executed long after they have become outdated.
Thus, there is a need for a database system which maintains a measure of changes in the data and automatically updates statistics as needed. Preferably, updates should be made to only those statistics that are derived from tables which have been updated. Additionally, execution plans which were generated on the basis of statistics that have since been updated should be discarded and new execution plans generated.
Briefly, the present invention is directed toward remedying these shortcomings by automatically maintaining statistics. According to the invention, for each table for which statistics are collected, a counter is maintained of the number of updates that are made to columns for which statistics are being maintained. When the counter reaches a modification threshold, the system automatically updates the statistics associated with the table.
According to another aspect of the invention, prior to running an execution plan that is stored in the execution plan cache, the system checks that the statistics from which the execution plan was derived have not been updated since the execution plan was generated. If the statistics have been updated, the execution plan is discarded and a new execution plan generated.