(1) Field of the Invention
The present invention relates to the field of computer systems used for storage and retrieval of data. More specifically, the present invention relates to the field of statistics measurement systems for a relational database management system.
(2) Prior Art
Computer implemented relational database management systems (RDBMS) are well known in the art. Such database systems commonly employ data tables that contain columns and rows containing data (e.g., data values). A typical RDBMS, in addition to maintaining the data of a particular database, also maintains a set of statistics regarding the data. These statistics are useful in efficiently accessing, manipulating, and presenting the stored data.
When an RDBMS system receives a query, its optimizer analyzes the structure of the query, analyzes the various clauses (e.g. selection and join predicates) specified in the query, and examines existing data access paths (e.g. indexes) to formulate a strategy (e.g., method) of performing various relational operations (e.g., aggregation, sort, search, join, etc.) to produce the result of the query. The optimizer generally explores various strategies to find the best strategy. A best strategy is the one with lowest estimated RDBMS cost, and such strategy generally takes least amount of RDBMS resources or least amount of time or both to produce the result of the query.
RDBMS cost is generally evaluated based on input/output (I/O) operations that are required to perform necessary relational operations to produce the result of the query. Therefore, the optimizer selection criterion for the best strategy for a query is a minimum execution cost. The minimum execution cost is calculated using available statistics such as table and index cardinalities, workload statistics (e.g. statistics on columns and column groups), storage statistics and assumptions of the manner in which any relational operation typically changes the incoming data size and data distribution.
Prior RDBMS systems collect workload statistics based on single columns of data. As such, these systems make zero correlation assumption that the values of two or more columns of data are not related in any way. However, in many cases, this zero correlation assumption is incorrect. For example, in an example data table having a first column of employee age and second column of employee job position, it is possible that in general the older employees can hold higher job positions. As such, the row values in the first column correlate to the values in the second column.
In order to estimate the cost of a strategy for a query on an example employees data table asking for all employees with a certain age (e.g., 35) and certain job position (e.g., 4), the prior art systems compute a result cardinality (result.sub.-- cardinality) based on the inverse of the distinct cardinality of the separate age column (DC1) multiplied by the inverse of the distinct cardinality of the separate income column (DC2) and this result is multiplied by the number of employees (#EE). The distinct cardinality of a column represents the number of distinct values within that column. See the computation of the result cardinality below:
result.sub.-- cardinality=(1/DC1*1/DC2)*#EE
The estimated result cardinality represents the average number of rows likely to be produced for a query asking for all employees with a certain age and a certain job position. However, the relationship used above by the prior art system assumes that there is no correlation between the first and second columns. This assumption leads to a computation of a result cardinality value that is much smaller than what is returned in reality due to data correlation. When the above result cardinality value is used by an optimizer, its determined estimated costs for performing certain relational operations on the subject data become inaccurate. Specifically, this inaccuracy can cause the optimizer to select a query resulting in worse performance in producing the result. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving more than one data column.
The presence of null data values within rows of a data table can lead to inaccuracies in the workload statistics, which will lead to inaccurate cost estimation analysis by the RDBMS optimizer. Most relational operations ignore incoming data rows that contain null data values in key columns. Therefore, it is important to also collect statistics about the number of rows with null data in certain columns or column groups. Prior art RDBMS systems that compute certain database statistics (e.g., distinct cardinality of a column) without regard to rows with null data produce inaccurate statistics. The inaccurate statistics can cause the RDBMS optimizer to select a sub-par strategy for a given query. Therefore, what is needed is a method for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for queries involving columns of rows with null data. The present invention provides such a system.
In prior RDBMS systems, a manual process is performed by users (e.g., system administrators) to determine which data columns of a database on which to collect statistics. In effect, in prior RDBMS systems, the user informs integrated statistics gathering procedures of the separate data columns on which workload statistics are to be gathered. However, it is often the case that the system administrators do not realize the specific mechanisms by which the RDBMS optimizer operates. Relying on users to indicate the data columns on which to collect statistics can be unreliable and inefficient. Namely, not knowing what is needed by the optimizer, the system administrator can cause the statistics gathering procedures to inefficiently over-collect statistics or fail to collect statistics on required columns. What is needed is an efficient method for identifying columns or column groups on which statistics collection is required. What is needed is such a method that does not rely on user origination of the above information.
Accordingly, the present invention provides a method and system for collecting statistics within an RDBMS system that provides for accurate cost estimation analysis for workload queries involving more than one data column. The present invention also provides a method and system for collecting statistics within an RDBMS system that provides for accurate estimated cost analysis for queries involving columns of rows with null data values. The present invention yet provides a method and system within RDBMS optimizer for identifying the columns or column groups on which statistics collection is required that does not rely on manual user identification of the column or column groups.