1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular, to selectivity prediction with compressed histograms in a parallel processing database system.
2. Description of Related Art
Relational DataBase Management Systems (RDBMS) are well known in the art. In an RDBMS, all data is externally structured into tables. A table in an RDBMS is two dimensional, comprising rows and columns. Each column has a name, typically describing the type of data held in that column. As new data is added, more rows are inserted into the table. A user query selects some rows of the table by specifying clauses that qualify the rows to be retrieved based on the values in one or more of the columns.
Structured Query Language (SQL) statements allow users to formulate relational operations on the tables. One of the most common SQL statements executed by an RDBMS is the SELECT statement. The SELECT statement generally comprises the format: xe2x80x9cSELECT  less than clause greater than  FROM  less than clause greater than  WHERE  less than clause greater than  GROUP BY  less than clause greater than  HAVING  less than clause  greater than  ORDER BY  less than clause  greater than .xe2x80x9d The clauses generally must follow this sequence, but only the SELECT and FROM clauses are required.
Generally, the result of a SELECT statement is a subset of data retrieved by the RDBMS from one or more existing tables stored in the relational database, wherein the FROM clause identifies the name of the table or tables from which data is being selected. The subset of data is treated as a new table, termed the result table.
The WHERE clause determines which rows should be returned in the result table. Generally, the WHERE clause contains a search condition that must be satisfied by each row returned in the result table. The rows that meet the search condition form an intermediate set, which is then processed further according to specifications in the SELECT clause.
The search condition typically comprises one or more predicates, each of which specify a comparison between two values comprising columns, constants or correlated values. Multiple predicates in the WHERE clause are themselves typically connected by Boolean operators.
Database statistics can be used by the RDBMS to optimize the execution of the SQL statements. For example, statistics may be used to estimate cardinality for a selection of records based on the search condition specified for one or more of the columns of the records. Intuitively, the more intervals in the statistics for the column at issue, the higher the accuracy of the cardinality estimation.
This can be argued from two extremes. At one extreme, there is only one interval. These xe2x80x9cstatisticsxe2x80x9d return an averaged number of rows per value as the estimate for all xe2x80x9csimple selectionsxe2x80x9d and this estimate remains the same for all values. This is the worst kind of statistics (though better than no statistics).
At the other extreme, there exists one interval for each distinct value in the column. This provides the perfect cardinality xe2x80x9cestimation.xe2x80x9d However, this approach is impractical as the number of intervals can become arbitrarily large. So, a trade-off must occur. And obviously, there should be as many intervals as possible, such that the frequencies within a single interval are more uniform and at the same time, the storage overhead must be acceptable.
Therefore, it makes sense to group column values that have roughly equal frequency in the same equal-height interval. The smaller the variation among frequencies in the same interval, the better the cardinality estimation. In the extreme, of course, if there is only one value (thus one frequency in the interval, which provides a 100% accurate estimation for that particular value (which is called a loner). And, this is exactly the idea behind high-biased intervals.
High-biased intervals store explicit column values and frequencies, so that a 100% estimation accuracy is obtained for these loners. Moreover, the rest of the column values can be made more uniform, if the column values with highest frequencies are removed from the equal-height intervals and put into high-biased ones. This way, not only do loners receive perfect estimation, but non-loners also benefit from increased uniformity.
Using the above concepts, the present invention has developed various techniques for optimizing the performance of an RDBMS using improved database statistics. The specifics of these improvements are described in more detail hereinafter.
The present invention discloses a method, apparatus, and article of manufacture for generating statistics for use by a relational database management system. A global aggregate spool is generated for each of a plurality of partitions of a subject table that are spread across a plurality of processing units of a computer system. Each of the global aggregate spools is scanned to generate summary records. The summary records are then merged to generate interval records for a compressed histogram of the subject table, wherein the compressed histogram includes both equal-height intervals and high-biased intervals. The compressed histogram can then be analyzed to estimate the cardinality associated with one or more search conditions of a user query or other SQL statement. Compared to a strictly equal-height histogram, the compressed histogram allows the relational database management system to more accurately estimate the cardinality associated with various search conditions. As a result, the relational database management system can better optimize the execution of the user query.
An object of the present invention is to collect improved database statistics. Another object of the present invention is to improve the performance of relational database management systems using the improved statistics.