To ensure generation of an efficient query plan, a database management system typically includes a query optimization software module. The query optimization software module generates search plans for query requests based on optimization rules that consider, among many variables, the size of the response set (amount of data expected to be returned) and the number of unique values within the data being queried.
Current, accurate database statistics are highly desired by performance sensitive query software modules (for example, a query optimizer) of database management systems. When database statistics are inaccurate or no longer current, the query optimizer is most likely to generate inefficient query plans. Inefficient query plans perform poorly at run time degrading overall performance of the database management system.
In the following simple examples the actual cardinality value and the estimated cardinality value are equivalent, which may not be the typical case.
Column cardinality is an example of an important database statistic. Column cardinality is a value representing the number of distinct values found in a database column. For example, the column cardinality of the NAME column in Table 1A of FIG. 2A is three, because there are three unique values (names) in the column. Column cardinality statistics (number of unique values within a set of values) are among some of the most commonly required statistics used by a query optimization software module. Column cardinality statistics are used in conjunction with other statistics to compute query plan resource consumption estimates which are then used in determining the most efficient plan for a given query.
Typical database systems estimate column cardinality because exact computation is too expensive in terms of time and resource and due to the dynamic nature of the value. In many commercial database systems, statistics collection is a batch operation, allowing the statistics processing workload to be scheduled based on database performance needs. Typically, database statistics represent a snapshot of the data only for the instant the statistics are generated. As a database table changes, previously computed statistics, associated with that table, may no longer be representative of the current state of the table and its indexes. Statistics for a table should be recalculated to reflect the current state of the table.
It is typically a simple task to incrementally maintain basic statistics such as the total number of rows in a table. However, column cardinality estimates are usually more difficult to maintain. Current techniques for maintaining estimates of column cardinality require rescanning the entire column as the data in that column changes. For example, assume a column “C” initially contains twenty distinct values and subsequently thirty additional distinct values are inserted. Computing a new estimate of column cardinality for column “C” by simply adding the cardinality of the two data sets may lead to an incorrect result due to duplicate entries. An allowance has to be made for duplicate values within the combined set of old and new values.
For example, consider Table 1A (FIG. 2A), EMPLOYEE, containing employee information which has two columns, one being NAME and the other SALARY.
The column NAME has the following four values: “smith”, “jones”, “doole”, and “smith”. Observe that the value “smith” is repeated twice, therefore, the estimated column cardinality value of column NAME is equal to “3”. This is because there are only three distinct values namely, “smith”, “jones”, and “doole”. On the other hand the estimated column cardinality value of the column SALARY is equal to “4” because all values in this column are distinct.
A typical database system has at least one method of computing column cardinality estimates. Typical examples of such methods are Linear counting and Logarithmic counting, both of which are well known to persons of ordinary skill in the art. When a database system computes an estimate of column cardinality, the database system stores the estimated value in what is known as a system catalog table for future use by the database system. For example, in a typical database system, the system catalog table that stores the column cardinality estimates for data in Table 1A of FIG. 2A may be as represented by Table 1B of FIG. 2B.
Table 1B (FIG. 2B), shows the column cardinality estimate of the column NAME belonging to table EMPLOYEE is “3” and the column cardinality estimate of the column SALARY belonging to table EMPLOYEE is “4”.
In a typical database system, a query optimizer (a component of the database system) then obtains the values of the column cardinality estimates from a database system catalog table. The query optimizer then uses the column cardinality values along with other statistics when computing the resource consumption estimates for a given set of database queries.
Consider once again the table Employee shown in Table 1A (FIG. 2A). A new row is added to Table 1A (FIG. 2A) to represent an employee called “harris” with a salary “200,000”. The updated table Employee is now depicted in Table 2A of FIG. 2C.
Computing a new column cardinality estimate value for the column NAME for example, according to current methods applied by database systems, requires reading all the NAME values; smith, jones, doole, smith, and harris.
After computing the cardinality estimates for both columns of the recently updated table in Table 2A (FIG. 2C), the database manager updates the system catalog table, as now shown in Table 2B of FIG. 2D.
The content of the database system catalog table in Table 2B (FIG. 2D) indicates that the new column cardinality estimate of the column NAME in table EMPLOYEE is “4”and the new column cardinality estimate of the column SALARY in the same table is “4” as well. Inserting the new row (“harris”), caused the column cardinality estimate of column NAME to change from “3” to “4”. The column cardinality estimate of column SALARY did not change, remaining at “4” because the added salary value “200,000” is a duplicate of an existing salary value. In practice, where columns may comprise millions of rows, computing the column cardinality exactly requires too much time and computer memory, and so the column cardinality must be estimated.
In a “linear counting” technique used to estimate column cardinality, a hashing function is used to transform each data value to a bit string that represents a position in a bit map. Initially all positions in the bit map contain zeros. Hashing a data value to a position in the bit map causes a value at the position to be set to one. A value of zero in a position of the bit map indicates there were no data values that hashed to that position. A value of one in a position in the bit map indicates that at least one data value hashed to that position. All data values in the column are hashed, and then the number of distinct data values is estimated by counting the number of bits in the bit map that are set to one (non-zero). Not all data values that hash to the same address are necessarily duplicates. Distinct values that hash to the same location are collisions. Collisions require the count to be adjusted using an adjustment formula, derived using probabilistic techniques (known to those of ordinary skill in the art).
Another well known estimation technique is “logarithmic counting.” Although more complicated than linear counting, the bit map used by logarithmic counting is much smaller than that used by linear counting. In logarithmic counting, a hashing function is used to transform each data value to a bit string, which is then further transformed by retaining the leftmost “1” position and converting all remaining positions to “0”. All data values in the column are doubly transformed to bit strings in this manner, and a final bit map is computed as the bit-wise “OR” of all the bit strings for the column. The leftmost “0” position in the final bit map is then computed, and this relative position number n is used to calculate the estimated cardinality value using the formula 2n/q, where “q” is set to a value “0.7735”. This latter expression is a well known probabilistic estimation formula that ensures statistically reasonable results.
As can be seen from the foregoing discussion, current cardinality estimation techniques typically require all data values in a column to be obtained each time a new estimate of that column's cardinality is requested. The detailed example shown previously illustrates a case where a new value was added, but the same requirement also holds for cases where an existing value is deleted or updated. The presence of duplicate entries in columns of data make it difficult to maintain the column cardinality estimates each time data values change. It is apparent, from the examples described, that there is a need for improved efficiency, or expediency or both regarding generation of database management systems statistics.