1. Field of the Invention
This invention relates to a database management system, and more particularly, to management of database statistics in a database management system.
2. Description of the Related Art
Relational database management systems allow large volumes of data to be stored and accessed efficiently and conveniently in a computer system. In a database management system, data is stored in database tables which organize the data into rows and columns. FIG. 1 depicts an exemplary database table 40 which has rows 42 and columns 44. To more quickly access the data in a database table, an index may be generated based on one or more specified columns of the database table.
The relational database management system responds to user commands to store and access the data. The user commands are typically Structured Query Language statements such as SELECT, INSERT, UPDATE and DELETE, to select, insert, update and delete, respectively, the data in the rows and columns. The SQL statements typically conform to a SQL standard as published by the American National Standards Institute (ANSI) or the International Standards Organization (ISO). The SQL statements are also referred to as queries.
In some relational database management systems, a query processor receives a query. A query optimizer evaluates various alternative access paths for the query. Some query optimizers utilize a cost based optimization technique to select the most efficient access path. The query processor executes the query using the selected access path. The query optimizer uses a set of database statistics to evaluate an access path. The database statistics comprise information about the state of the data in the relational database management system. The relational database management system typically provides one or more tools to collect the database statistics.
In cost based optimization, the optimizer estimates the cost of the available access paths and selects the access path with the lowest estimated cost, which may be in time, resources, or a combination thereof. One of the challenges for cost based optimization is obtaining an accurate estimate of the cost of the access paths. If the cost of the access paths cannot be estimated accurately, then sub-optimal and unstable access paths may be selected because the access path with the lowest actual cost may not have the lowest estimated cost. Sometimes efficient and inefficient access paths have very similar cost estimates, and minor changes to the input to the optimizer can result in dramatic changes in execution time.
In a relational database management system, timely statistics information can improve performance. INSERT, UPDATE and DELETE statements, in addition to various utilities, may change the data in the database. As the data changes in the database, the performance of the database management system may degrade because the statistics do not reflect the current state of the database. Hence, the cost of the access paths may not be accurately estimated, inefficient access paths may be chosen, and the amount of time to process queries may increase. Therefore, database administrators periodically re-collect the database statistics.
Database administrators typically collect nominal database statistics, such as single column cardinality, on all tables, columns of the tables and indexes. Collecting database statistics on all tables, columns and indexes can consume a large amount of time. However, database administrators typically do not have the time and/or expertise to analyze the queries to determine what tables, columns, column groups and indexes are most important and what database statistics would be most beneficial to improve performance.
Sometimes database statistics are collected for columns which have not been referenced in any SQL statement. When columns are not used as predicates or aggregates within a SQL statement, statistics on those columns are useless. Collecting database statistics that are not used wastes both processor and storage resources and consumes a large amount of processor time. Processor time is expensive and it is very expensive to spend valuable processor time collecting statistics rather than running business applications. Indeed, the expense of collecting database statistics for all columns and indexes often results in re-collecting statistics less frequently than would be advantageous to maintain desirable performance levels.
Various statistics collection tools examine real-time statistics to trigger re-collection. For example, one statistics collection tool examines the growth in table size to trigger re-collection. If the table grows by, for example, fifteen percent, then database statistics are re-collected. Another statistics collection tool monitors the number of INSERT, UPDATE and DELETE statements that were performed on the table. If the number of INSERT, UPDATE and DELETE statements exceeds a threshold, then re-collecting database statistics is recommended. Some statistics collection tools examine the system throughput and schedule database statistics re-collection at off-peak times with a very low scheduling priority. However, these statistics collection tools do not address what statistics to collect, and the statistics re-collection can remain inefficient because the re-collection consumes a large amount of time and ineffective because the re-collection may not be performed sufficiently often to reflect the state of the database.
Database management systems typically provide a command to collect statistics about the database for use by the query optimizer. As the database is used, the database statistics should be updated periodically to reflect the current state of the database. Typically, database administrators initially collect an initial set of all database statistics to provide a sound foundation of database statistics for the optimizer to accurately estimate the cost of the access paths for queries. Many database administrators will collect this complete set of statistics once, and then scale back the amount of statistics which are re-collected due to the central processing unit (CPU) cost of executing a utility to collect the database statistics.
The data stored in a database typically changes over time. If the database statistics are not re-collected, then the information they provide can be inaccurate and lead to selecting inefficient access paths. After a sound foundation of database statistics is collected, the database statistics should be re-collected periodically. Therefore, it would be desirable to determine when to re-collect the database statistics to ensure continued efficient performance.
Therefore, there is a need for an improved technique for managing statistics in a database management system. This technique should address when database statistics are re-collected.