Relational database systems are important tools for storing, retrieving, and processing information. In order to retrieve information from a database, a user provides a query written in a query language such as Structured Query Language (SQL). The query specifies the information to be retrieved and, in some cases, the manner in which it is to be manipulated or evaluated in order to provide the desired result. Queries may contain requests to derive information by performing set operations on the tables, such as join, sort, merge, and so on. To process the query, the database system may convert the query into a relational expression that describes algebraically the result specified by the query. The relational expression is then used to produce an execution plan, which describes particular steps to be taken by a computer in order to produce the sought result.
Query optimizers are used to improve database performance. The role of a query optimizer is to take a query, as specified by the user in a high level language, and generate an efficient execution plan for such query. To select an execution plan, the query optimizer picks the plan with the least anticipated execution cost from possible candidate plans. Unlike other common programming languages, database query languages are declarative rather than procedural. Thus, query optimizers can consider a very large number of execution alternatives, based on the size of tables used, the data distributions, and existing indices or other access paths.
Selection of one specific execution plan to be used over other possible execution plans is based on the estimated execution cost of execution plans. Such estimation is based on the estimated number of rows that will be flowed in each step of the execution plan. Estimating the number of rows flowed in each step of an execution plan is commonly known as the “cardinality estimation problem.”
In conventional query optimization methods, cardinality estimation is performed by starting with statistics collected on base tables. Such statistics are typically gathered by executing special purpose queries or processes, which read all or part of a database table, perform some analysis, and store the results for later use by the query optimizer. Statistics gathering may be triggered automatically based on the columns used to execute a query.
For conventional query optimization, the size of the table and one or more histogram containing statistics about the values in the table may be used. For example, a table may have 600,000 rows. Each row may have a value for a specific variable. A histogram may show the number of rows which have a certain value for the variable. For a continuous variable, the histogram may divide the range for the continuous variable into subranges, and show the number of rows with a value for the variable in each subrange. Thus, when a query requests the rows with a value for the continuous variable above a specified number (in order to perform, for example, a join of those with another set of rows) it can be estimated how many data rows will be flowed (in the example, to the join.)
Thus, the statistics collected on base tables are used to estimate the number of data rows that will qualify different data manipulation operators such as filter conditions, joins, and aggregations.
However, with this approach arbitrary large estimation errors can be introduced while deriving the number of qualifying rows through different operators. Errors grow as estimation is done on top of estimation, so that after several filters and joins, the estimated cardinality may be very far from the actual value. In addition, there are constructs that simply cannot be estimated based on statistics of base table columns. The standard approach when such constructs are encountered is to use a “guess” or “magic number,” such as a ⅓ data reduction factor for inequality comparisons and 1/10 data reduction factor for equality.
This in turn introduces estimation errors in the estimated cost of plans, which leads to selecting execution plans with very poor performance. The quality of plans generated by the optimizer is tied to the accuracy of its cost estimation. Incorrect estimation may lead the optimizer to regard some plans as efficient, when in reality they are very expensive to execute. As effective optimization and good physical design can introduce dramatic performance improvements, so selecting the wrong execution plan can lead to dramatic slowdowns.
Estimation errors may occur in several different situations. For example, where there is a predicate involving operations on multiple columns of the same table or scalar operations on one column (or multiple columns) of a table, an estimation error may occur. An example of the use of two columns from the same table can be seen in the following query, where “LINEITEM” is the table and L_EXTENDEDPRICE and L_DISCOUNT are two columns from that table:                SELECT * FROM LINEITEM        WHERE                    L_EXTENDEDPRICE>L_DISCOUNT                        
In this example, the cases (rows) from LINEITEM where the L_EXTENDEDPRICE column value is greater than the L_DISCOUNT column value are requested. For each row, this performs a comparison of the values in two different columns for the same row. Because a histogram describes the population of a column only collectively, there is no way to take two histograms describing the two columns (L_EXTENDEDPRICE and L_DISCOUNT) and determine what the number of rows might be for which L_EXTENDEDPRICE>L_DISCOUNT.
As another example, scalar operation on multiple columns from the same table may be seen in the following query, where, again, “LINEITEM” is the table and L_EXTENDEDPRICE and L_DISCOUNT are two columns from that table.                SELECT * FROM LINEITEM        WHERE                    L_EXTENDEDPRICE *(1-L_DISCOUNT)>900000                        
In this example, the cases (rows) from LINEITEM where the L_EXTENDEDPRICE column value times one minus the L_DISCOUNT column value is less than 900,000 are requested. Although a histogram may exist for each of L_EXTENDEDPRICE and L_DISCOUNT, these histograms may not include enough information to determine the number of rows which may result from the query. Using very detailed information to try to estimating the distribution of the product of the two column values in the query is costly, and using less detailed information can easily introduce large errors.
Additionally, scalar operations may be used in a query. Because of the limited information on which estimates are being based, such scalar operations may lead to problems in forming estimations. For example, arithmetic modulo (%), conditional evaluation (CASE-WHEN-ELSE-END), and string operations such as concatenation and substring evaluation may not lend themselves to estimation with the data contained in histograms.
Additionally, errors may be introduced when the independence assumption is violated. As an example, the following query selects cases based on two column values, where CUSTOMER and NATION are tables and a “C_” prefix denotes a column in the CUSTOMER table and an “N_” prefix denotes a column in the NATION table:                SELECT * FROM CUSTOMER, NATION        WHERE C_NATIONKEY=N_NATIONKEY        AND N_NAME=‘BRAZIL’        
In this example, rows in the CUSTOMER table are selected for which the C_NATIONKEY corresponds to a N_NAME of “BRAZIL”. If an assumption is made that the C_NATIONKEY value for the rows in the CUSTOMER table are approximately evenly distributed among the possible values for C_NATIONKEY, it can be assumed that approximately 1/(possible values for C_NATIONKEY) rows will be selected. However, this assumption, that the distribution is independent of any other factors, may be incorrect.
This can also be seen to be a problem in the previous example query. One way to derive an estimate is to assume that the distribution of L_EXTENDEDPRICE values, as shown in the histogram corresponding to that column, is independent of the distribution of L_DISCOUNT values. However, this may not be a correct assumption, and may lead to an estimation error.
Additionally, where aggregate results are estimated for groups of rows (for example, grouped by a group key) estimation errors may be introduced.
Again, while these estimation errors may be small, the cumulative effect of such estimation errors when a number of estimates are used to determine the cost for an execution plan may be large.
Where advanced operations, such as transitive closure, Pivot/Unpivot, and statistical extensions (such as the proposed SQL Statistical Extensions), are performed, these advanced operations can not be estimated using prior art techniques for estimating the cardinality of such complex queries from the standard table column statistics. User-defined functions and aggregates can also not be estimated, in some cases, using prior art techniques for estimation.
Thus, there is a need in the art for systems and methods for cardinality estimation with improved performance over these in which statistics regarding tables are used. It is desired that such systems and methods improve the accuracy of cardinality estimation.