1. Field of the Invention
This invention relates to database query optimization and more particularly relates to accurately estimating the sizes of intermediate query results for use in query optimization.
2. Description of the Related Art
Business Intelligence or On-line Analytical Processing (OLAP) applications are often carried out on data stored in data warehouses. These involve complex queries and very large databases, and the optimization of query access plans improves the efficiency of these database systems.
Star schema is perhaps the most commonly used model for databases in data warehouses. At the center of a star schema is typically a very large fact table containing the individual facts which constitute the primary information being stored in a database. Queries of the database often join the fact table to a number of much smaller dimension tables through primary-key/foreign-key relationships. The fact table typically has many millions of rows, while the dimension tables may have several hundred rows. Each of the dimension tables contains detail information about a particular attribute in the fact table.
The task of transforming a declarative query into a physical query execution plan (QEP) in a database system is usually left to a query optimizer. The query optimizer chooses an optimal plan based on cost estimates of various candidate plans. Accurate estimation of the sizes of intermediate query results, or cardinality estimation, is needed for accurate plan costing in query optimization. The common practice in current commercial database systems is to derive the cardinality estimates from base-table statistics. However, this approach often suffers from simplifying unrealistic assumptions that have to be made about the underlying data, such as when different attribute values are independently distributed.
There have been proposals for exploiting statistics on query expressions to improve the accuracy of cardinality estimation. One way to compute such statistics is for the views that represent intermediate results of join instructions to be fully materialized or generated and statistics collected on the materialized view. A major drawback of this method is its high computation cost, particularly when tables are large and joins are involved. Because generating the materialized view requires a full execution of the query expression, the computation can be very expensive. Although this cost can potentially be amortized over many queries whose optimization can utilize the statistical view, in cases where the underlying data is updated frequently, the cost of re-computing the view and then updating the statistics can be prohibitive.
Furthermore, the statistical views are often defined without any specific query in mind. Therefore, the set of statistical views must be carefully selected to ensure their value with regard to the optimization of prospective queries.
As an example, in a data warehouse scenario, traditional estimation techniques used to compute the intermediate results between a fact table and a dimension table could be significantly inaccurate when there is skew in the values in the join column of the fact table or significant difference in the values in the dimension table join column compared to the fact table join column. Thus, a need exists for improving query optimization within a database management system by accurately estimating the sizes of intermediate query results for joins between the fact table and the dimension table. One way to solve this is to make use of a predefined statistical view that has the relevant statistics collected periodically. Another way is to have an appropriate statistical view dynamically generated on-the-fly during query optimization and have the statistics collected. In both these situations, there is a need to collect the statistics very quickly. For the on-the-fly generated statistical view, the need for fast computation of the statistics facilitates minimizing the cost of executing the query.