Database partitioning improves the search efficiency of the database system in certain situations by avoiding the need to search an entire table. With database partitioning, an object, such as a database table, is divided up into sub-tables, referred to as “partitions”. The particular partition in which a row or object is stored depends on values in one or more columns of the table. These columns are referred to herein as partition keys and the values contained in them are referred to as partition key values.
The most common form of partitioning is referred to as range partitioning. With range partitioning, each individual partition corresponds to a certain range of partition key values. For example, a partition key of a table may store date values that fall within a certain year, and the table may be divided into twelve partitions, each of which corresponds to a month of that year. Rows with a date value that fall within a certain month are stored in the partition that corresponds to that month.
Another form of partitioning is referred to as hash partitioning. In hash partitioning, one or more values in partition keys are applied to a hash function to produce a hash value. A separate partition is established for a hash function value produced by the hash function, and rows having a partition key value that evaluates to that hash function value are stored within the partition.
Statistics are often kept pertaining to data stored in a database. In partitioned tables, statistics may be kept at both the partition level (“partition statistics”) and the table level (“global statistics”). An example use for these statistics is a cost-based query optimizer; however, other uses exist. A cost-based optimizer chooses an optimum execution plan for an SQL statement based on statistics about the data accessed by the statement. It is therefore crucial for the quality of the execution plan that statistics accurately reflect the state of the database objects.
Both partition and global statistics are important to a query optimizer. Partition statistics, which apply to a particular database partition, are useful because of a technique called “partition pruning.” Given a query, the query optimizer may eliminate (i.e., prune) unnecessary partitions from access by the SQL query. For example, consider the example table “line_item” that is partitioned on a “ship_date” column so that line items that are shipped in the same month fall into the same partition. Further, consider the query:
select * from line_item where ship_date between ‘2006-06-01’ and ‘2006-06-30’ and tax>10000
The above query will be issued only on the partition that contains all the line items shipped in June 2006. Partition pruning reduces the volume of data accessed and can lead to substantial performance improvement. If a query is found out to be pruned down to a single partition, statistics on this particular partition can further help the optimizer to find a good execution plan. In the previous example, suppose there is an index on a tax column in all partitions. In this case, the query optimizer can rely on partition statistics on the selectivity of tax>10000 in the June 2006 partition to choose between a full-scan and an index-scan on the partition.
A particular statistic of interest is the number of distinct values (NDV) in a portion of a table. To determine the NDV, most commercial database systems read rows from the table and count the NDV by performing an expensive aggregate operation. The aggregate operation involves sorting or hashing the input rows. The database system minimizes the resources by performing the aggregation on a subset of rows produced by sampling of the data. The count computed from the sample is then scaled up to represent the entire population. The sample size required for accurately estimating the NDV depends on the data distribution of the columns.
Two strategies for sampling the data include row sampling and block sampling. Row sampling reads rows without regard to their physical placement on disk. This provides the most random data for estimates, but can result in reading more data than necessary. For example, a row sample might select one row from each block, requiring a full scan of the table or index.
Block sampling reads a random sample of blocks and uses all of the rows in those blocks for estimates. This reduces the amount of I/O activity for a given sample size, but it can reduce the randomness of the sample if rows are not randomly distributed on disk. Reducing the randomness can affect the quality of the estimate of number of distinct values.
Both row and block sampling techniques produce inaccurate statistics for some tables. Some database systems recommend collecting statistics for those objects manually using a 100% sample and leave those objects untouched by automatic statistic collection. However, the statistic collection on large sample is expensive. For large tables, some database systems recommend using a 1% sample. Statistics collected using such a small sample can be very inaccurate.
One problem with determining an NDV is determining an adequate sample size. One possible solution for automatically determining an adequate sample size is an iterative approach. Initially, a small sample is taken for each column. Based on statistical analyses, a determination is made as to whether the sample size for each column is large enough. For each column needing a larger sample, the sample size is increased. However, in the next iteration of sampling, the information collected is restricted to only those columns which need the larger sample.
Based on all of the foregoing, it would be desirable for improved techniques for deriving database statistics.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.