The present invention is directed to an improvement in computing systems and in particular to an improved system for the estimation of column cardinality in a partitioned relational database.
In relational database management systems (RDBMS) it is common to utilize query optimizers to improve the efficiency of processing of queries to be carried out on a relational database. One of the most commonly required statistics for such query optimisation is the column cardinality for a table in a relational database. Column cardinality is the number of distinct values contained in a column in the table of the database. In a serial database, column cardinality may be calculated relatively easily if an index is defined on the column being considered. Where there is no index on a column for which the cardinality is to be calculated, a sort of the values in the column is required to provide an exact measurement. Such sorting is an expensive operation and is not desirable in a database system. Because query optimisers do not require exact statistics to function effectively, a good approximation of column cardinality is sufficient in most cases to obtain a good query plan.
There are a number of techniques known in the prior art to obtain approximations for column cardinality without requiring the sorting of column values in a database table. Examples of such prior art techniques include sample counting, linear counting, and logarithmic counting. These techniques are described in Morton M. Astrahan, Mario Schkolnick, and Kyu-Young Whang, xe2x80x9cCounting Unique Values of an Attribute Without Sorting,xe2x80x9d Information Systems 12, 1(1987).
In a partitioned RDBMS, such as a share-nothing parallel database management system, tables may be partitioned across several nodes which do not share data. In such an environment it is potentially difficult to calculate column cardinality. The same value may occur in multiple nodes and therefore it is not possible to simply sum the column cardinality values for each node to obtain a table""s overall column cardinality value for the different nodes in the parallel database. One approach is used in the DB2 universal database (UDB) (trade-mark) in the parallel database environment. This approach relies on statistics for column cardinality being calculated on a single node. The node used will be treated as being representative of the data in the column across the different nodes in the partitioned database. In fact, the node may or may not be representative of the data as a whole. As a query is optimised, the overall column cardinality (across all nodes) is estimated using a known probabilistic formula. The column cardinality for the representative node, the number of table rows in that node, and the number of nodes across which the table is partitioned are used to estimate the overall column cardinality. There is overhead involved in such an approach, and the approach is also limited where the node used to represent the data as a whole is in some way atypical of the data value distribution. As a result the estimated overall column cardinality using this approach may vary considerably from the actual value.
It is therefore desirable to have a technique for estimating the cardinality of a column in a partitioned relational database table which is efficient and which provides a reliable estimate of the column cardinality across all nodes in which the table data is stored.
A system, method and computer readable medium for estimating a column cardinality value for a column in a partitioned table stored in a plurality of nodes in a relational database is disclosed. According to one embodiment of the present invention, a plurality of column values for the partitioned table stored in each node are hashed, and a hash data set for each node is generated. Each of the hash data sets from each node is transferred to a coordinator node designated from the plurality of nodes. The hash data sets are merged into a merged data set, and an estimated column cardinality value for the table is calculated from the merged data set.
Advantages of the invention include an efficient technique for providing a reliable estimate of column cardinality in a partitioned relational database.