1. Field of the Invention
This invention relates generally to database management systems, and more particularly to collecting statistics in database management systems.
2. Description of the Related Art
Relational database management systems allow large volumes of data to be stored and accessed efficiently and conveniently in a computer system. In a database management system, data is stored in database tables which organize the data into rows and columns. The relational database management system responds to user commands to store and access the data. To more quickly access the data in a database table, an index is generated based on one or more specified columns of the database table.
The user commands are typically Structured Query Language statements such as SELECT, INSERT, UPDATE and DELETE, to select, insert, update and delete, respectively, the data in the rows and columns. The SQL statements typically conform to a SQL standard as published by the American National Standards Institute (ANSI) or the International Standards Organization (ISO).
In some relational database management systems, a query processor receives a SQL statement. A query optimizer evaluates various alternative execution plans for the SQL statement and typically selects an execution plan having a lowest estimated cost, which may be in time, resources, or a combination thereof. The query processor executes the selected execution plan. The query optimizer uses a set of database statistics to evaluate an execution plan. The database statistics comprise information about the state of the data in the relational database management system. The relational database management system typically provides one or more tools to collect the database statistics.
In a relational database management system, timely statistics information can improve performance. INSERT, UPDATE and DELETE statements, in addition to various utilities, may change the data in the database. As the data changes in the database, the performance of the database management system may degrade because the statistics do not reflect the current state of the database. Hence, the amount of time to process queries may increase. Therefore, database administrators periodically re-collect the database statistics.
Database administrators typically collect nominal database statistics, such as single column cardinality, on all tables, columns of the tables and indexes. Collecting database statistics on all tables, columns and indexes can consume a large amount of time. However, database administrators typically do not have the time and/or expertise to analyze the queries to determine what columns and column groups are most important and what database statistics would be most beneficial to improve performance.
Sometimes database statistics are collected for columns which have not been referenced in any SQL statement. When columns are not used as predicates or aggregates within a SQL statement, statistics on those columns are useless. Collecting database statistics that are not used wastes both processor and storage resources and consumes a large amount of processor time. Processor time is expensive and it is very expensive to spend valuable processor time collecting statistics rather than running business applications. Indeed, the expense of collecting database statistics for all columns and indexes often results in re-collecting statistics less frequently than is otherwise desirable to maintain desirable performance levels.
Various statistics collection tools examine the real-time statistics to trigger re-collection. For example, one statistics collection tool examines the growth in table size to trigger re-collection. If the table grows by, for example, fifteen percent, then database statistics are recollected. Another statistics collection tool monitors the number of INSERT, UPDATE and DELETE statements that were performed on the table. If the number of INSERT, UPDATE and DELETE statements exceeds a threshold, then recollecting database statistics is recommended. Some statistics collection tools examine the system throughput and schedule database statistics re-collection at off-peak times with a very low scheduling priority. However, these statistics collection tools do not address what statistics to collect, and the statistics re-collection can remain both inefficient because the re-collection consumes a large amount of time and ineffective because the re-collection may not be performed sufficiently often to reflect the state of the database.
Therefore, there is a need for an improved technique for statistics collection in a database management system.