Frequent-value statistics are among the most commonly required statistics in query optimization. For a fixed number N where N>=1, the N most frequent values in a column consist of the data value having the highest frequency (that is, number of duplicates), the data value having the second highest frequency, and so forth, down to the data value having the Nth highest frequency. The corresponding frequent-value statistics consist of these “N” data values, together with the frequencies of these data values in the column. The query optimizer component of a database system uses the frequent-value statistics, in conjunction with other statistics such as column cardinality, to estimate the number of rows in a column that satisfy an equality condition or range predicate in order to select the least expensive access plan for the query. For example, consider a database table T with a single column C for which the frequency of the data values is as follows:
Data ValueFrequency12233404451
Suppose that the following SQL query is issued: SELECT*FROM T WHERE C=3. According to the Data Value and Frequency table above, exactly 40 rows will satisfy the predicate ‘C=3’. If the frequent-value statistics are not available (or not accurate), a database query optimizer will only use the total number of rows in the table and the column cardinality (also known as the number of distinct values in the column) statistics to estimate the number of rows that would satisfy the predicate ‘C=3’. In this case, there are 50 rows in the database table T (i.e., the sum of the frequencies 2, 3, 40, 4 and 1 from the Data Value/Frequency table above) and there are 5 distinct values in column C, namely 1, 2, 3, 4 and 5 (according to the Data Value and Frequency table above). Therefore, the number of rows that satisfy the predicate ‘C=3’ is estimated as 50 divided by 5 equals 10, an error of negative 75%. If the frequent-value statistics based on only the most frequent value (that is, N=1) are available, the number of rows that satisfy the predicate ‘C=3’ would be estimated as 40, without any error. This is because the value ‘3’ is among the frequent value-statistics available to the database query optimizer and therefore its number of occurrences can be used to determine the number of rows that satisfy the predicate ‘C=3’.
To accurately compute the N most frequent-value statistics, a database system typically would need to perform at least two sort operations. The first operation is to sort the column values themselves, and the second one is to sort the column values according to their frequencies. If there is an index defined on the column then the first soft operation is not necessary because the database system can use the order that the index inherently provides. Given that sort operations are very expensive, a number of algorithms and techniques have been proposed for obtaining approximations to the frequent-value statistics in the context of traditional or non-partitioned databases.
On the other hand in a partitioned relational database, such as a share-nothing parallel database management system where the table data is partitioned across several shared-nothing processing nodes, computing the frequent-value statistics efficiently is more challenging, whether or not there is an index defined on the column. A scheme that simply computes the N most frequent values on each processing node and then sorts the per node results to obtain the overall N most frequent values (i.e., across all nodes) can lead to wrong results.
For example suppose that a particular column value is the (N+1)th most frequent value on a node A and is, say the (N+2)th most frequent value on a node B. Further suppose that the overall (from node A and from node B) number of occurrences of this column value is such that it becomes one of the overall N most frequent values (i.e., across all nodes). Obviously, this column value would have been missed by the above scheme because it only computes the N most frequent values on each processing node. The above scheme can surely lead to correct results only when the column for which the N most frequent values are being computed is the single partitioning key of the database table. This is because the same column value can not be present on more than a single processing node if the column is the single partitioning key of the database table.
Another technique used to compute the most frequent values in a column in a partitioned relational database environment is sampling. For example a node may be selected as being representative of all nodes and frequent value statistics for that selected node calculated and used as representative of all nodes. The difficulty with a sampling technique is the accuracy of the result diminishes with the sample size. In order to improve the quality of the computation the sample size has to be increased to larger size. The increase in sample size leads to a requirement for more processing time and associated resource consumption. Sampling may also lead to erroneous results when the sample data selected is atypical of the data value distribution across nodes. As a result the estimated frequent value statistics may vary considerably from the actual value.
It is therefore desirable to have a technique for computing the most frequent values in a column in a partitioned relational database which is efficient and provides a reliable estimate of the frequent values across all nodes in which the table data is stored.