Database management systems (DBMS), including both hierarchical and relational DBMS, receive a database query from a user and return results to the user. Relational database management systems (RDBMS) such as Oracle®, IBM DB2®, and Microsoft SQL server®, among others, are no exception. A DBMS receives a database query from a user and uses the search criteria provided in the query to find and return results to the user. These results may be actual data or statistical information about the data. For example, a query may return all records/rows concerning transactions initiated by a customer—actual data in the database—or may return the number of transactions initiated by a customer—statistical information about the data.
Queries are generally formulated using a standardized query description language. For example, Structured Query Language (SQL) is a widely used standardized query description language that many DBMS use. Queries may be submitted to a DBMS from diverse types of users. For example, a person submitting queries to the DBMS using a native DBMS tool may be one such user. In this example, a user may be provided with a SQL statement editor allowing immediate execution of SQL statements on a database by the DBMS. In another example, a script file may connect to the DBMS and fire SQL statements against a database. In this example, the external script file is the user. In a third example, an external application may translate a user interface (UI) action into SQL statements that are sent to the DBMS with the external application receiving and translating the results from the DBMS into an appropriate representation on the external application UI. These examples are indicative of the broad range of users—e.g., individuals, scripts, and software applications—that may submit queries to a DBMS.
DBMS generally use an optimizer to facilitate execution of a query. The optimizer calculates the most efficient way or more efficient ways to retrieve and access the data stored in a database. There are many constraints that effect the way in which an optimizer makes these calculations. For example, the optimizer may consider whether one or more indices exist that can be used to reduce the time and resources needed to retrieve the queried data. The value of these indices in expediting the query is relative to their selectivity. For example, the greater the ratio of the number of records/rows filtered or sampled by the index to the total number of records/rows in the table (1:5 being greater than 1:10), the lower the selectivity of the index and the less utility provided by using the index. On the other hand, the lower the ratio of filtered records/rows to total records/rows, the higher the selectivity and the greater the utility to the optimizer in using the index in conducting the query. In another example, value distribution information for data in a table of a database may be used to expedite a query. Value distribution information loses its importance as it becomes less current as a result of changes made to the table data.
In general, two main types of database optimizer exist—Rule Based Optimizer (RBO) and Cost Based Optimizer (CBO). RBO use heuristic rules in determining the best method to access the queried data. CBO uses statistical information about the table data and the corresponding table indices in determining the best method to access the queried data. RBO do not typically rely on statistical information and table indices and, therefore, improvements in the collection of such information generally do not improve the performance of RBO. On the other hand, CBO performance is directly related to the quality of this statistical information and the table indices and the frequency with which they are updated.
Statistical information about table data becomes increasingly obsolete as changes are made to a database table. The degree of obsolescence is related to the frequency of change to the database table data and the amount of elapsed time since the last updating of the statistical information. For this reason, it is important for the statistical information to be updated regularly in order to maintain adequate CBO performance. The process for updating statistical information is usually scheduled by the database administrator who weighs the performance cost to the database management system during the update of the statistical information with the need to maintain adequate CBO performance. A typical result of obsolete statistical information is the CBO initiating a more resource or time intensive retrieval of the queried information thereby decreasing query performance and potentially impacting the response times of other users also connected to and executing queries on the database. Restating this in terms of cost, a CBO using obsolete statistical information may execute more expensive queries.
A database management system will typically include functionality to update the statistical information and indices for database tables. However, creating or updating this statistical information may considerably tie up database resources in a resource expensive manner. Typically, the resource expenses associated with updating statistical information for larger tables are greater and more apparent than for smaller tables. As previously stated, the process for updating statistical information is usually scheduled by the database administrator who weighs the performance cost to the database management system during the update of the statistical information with the need to maintain adequate CBO performance. The updating of the statistical information may have a large cost resulting in significant degradation in database management system services during the period of the update. For this reason, the updating of statistical information is generally scheduled for periods of limited user activity on the database.
Compounding the difficulty in this scheduling are situations where a database and database management system serve users requiring more consistent twenty-four hour access. Under these circumstances, it is often difficult or impossible to schedule the updating of statistical information without impacting other users. In order to minimize the impact on users, the database management system may not use all the records/rows in a table when determining table statistics and may instead use a sample from the records/rows in the table. The use of samples expedites the updating of the statistical information reducing the impact on users. However, the use of samples may result in less accurate statistical information resulting in more expensive CBO queries. In other words, the quality of the statistical information generated using samples can not be guaranteed and may not be as accurate.
Significant performance improvements can be achieved if statistical information is updated using the full table data or otherwise in a manner avoiding the expense to the database management system and the impact on database users that current updating of statistical information poses.