Databases are used to store information for an innumerable number of applications, including various commercial, industrial, technical, scientific and educational applications. As the reliance on information increases, both the volume of information stored in most databases, as well as the number of users wishing to access that information, likewise increases. Moreover, as the volume of information in a database, and the number of users wishing to access the database, increases, the amount of computing resources required to manage such a database increases as well.
Database management systems (DBMS's), which are the computer programs that are used to access the information stored in databases, therefore often require tremendous resources to handle the heavy workloads placed on such systems. As such, significant resources have been devoted to increasing the performance of database management systems with respect to processing searches, or queries, to databases.
Improvements to both computer hardware and software have improved the capacities of conventional database management systems. For example, in the hardware realm, increases in microprocessor performance, coupled with improved memory management systems, have improved the number of queries that a particular microprocessor can perform in a given unit of time. Furthermore, the use of multiple microprocessors and/or multiple networked computers has further increased the capacities of many database management systems. From a software standpoint, the use of relational databases, which organize information into formally-defined tables consisting of rows and columns, and which are typically accessed using a standardized language such as Structured Query Language (SQL), has substantially improved processing efficiency, as well as substantially simplified the creation, organization, and extension of information within a database.
Furthermore, significant development efforts have been directed toward query “optimization,” whereby the execution of particular searches, or queries, is optimized in an automated manner to minimize the amount of resources required to execute each query. A query optimizer typically generates, for each submitted query, an access plan. In generating an access plan, a query optimizer typically creates multiple potential access plans, and selects the best among those potential access plans based upon the “cost” of each plan. The cost of each plan represents the amount of resources expected to be utilized by the plan (typically expressed in the expected amount of time to execute), and as such, selecting the plan with the lowest cost typically results in the most efficient (and quickest) execution of a query.
Database statistics are often used by a query optimizer to estimate the cost of potential access plans. For example, some optimizers rely on Frequent Value Lists (FVL's), which identify the most populous values of a column of a table. A FVL may also simply identify a certain number (N) of the most populous values, instead of all the populous values in the column of a table. Among other benefits, a FVL may be used to identify skew values, i.e., values that are represented more times in a column than an average value. Typically, skew values are among the most populous values of a column since skew values are represented more times in the column than an average value. Therefore, when a FVL is built for the column, the skew values are typically included in the FVL.
In many queries, there is often a join operation that is performed between columns of tables. A join is typically an operation used to combine data from two or more different tables according to some criteria. Many different types of joins exist. An inner join typically returns only those rows from the tables that match according to the criteria. An outer join, however, typically returns both the rows that match and the rows that do not have a match.
In many systems, separate FVL's are generated for each column of a join. During optimization, the FVL's of the columns in the join may be combined to estimate a join fanout, which is an estimate of the number of rows that will be returned or joined after the join operation completes. Therefore, FVL's are also beneficial in estimating the number of rows that will join, and thus ultimately what access plan will operate most efficiently given the estimated number of rows. Skew values are important for the join fanout calculation because a join that has columns with skew values will likely join with more rows than a join that has columns without skew values. Therefore, in order to accurately estimate join fanout, it is important to ensure that any skew values present in a column are considered when estimating the join fanout. Otherwise, when skew values are missing in any of the FVL's, this may lead to an underestimated join fanout when the FVL's are merged during optimization. As a result, the number of rows that will likely join may be underestimated. Moreover, the cost associated with the amount of time and resources required to carry out the join may also be underestimated and may mislead the optimizer to believe that the cost of an access plan is lower than it actually will be.
It has been found, however, that the FVL's of a join that contains a unique column may introduce a problem when attempting to estimate a join fanout during optimization. A unique column is a column that contains each unique value exactly once, whereas non-unique columns may contain values that are repeated in multiple rows. In particular, when one of the columns in the join is unique, the FVL of the unique column in the join may not properly reflect any skew values for the join as a whole. Typically, a skew value in a non-unique column stands out and will appear in the FVL of a non-unique column as one of the N most populous values. However, that same value may not stand out in a unique column. As a result, a value in a unique column that is a skew value in the column of a foreign table of the join, but that is not a skew value in the local table, is often omitted from the FVL of the unique column.
Conventional optimization techniques do not provide a way to ascertain the skew values in a foreign table. Moreover, since particular values in a unique column may not stand out, many designs arbitrarily or randomly sample values from a unique column to build the FVL for the unique column. Thus, using conventional techniques, the probability that the skew value of a foreign table is randomly chosen from the unique column to build the FVL is often low. The probability depends upon the value of N and the quantity of records in the unique column. Given that the quantity of records can range from tens of records to millions of records, the unpredictable nature of the conventional techniques typically results in FVL's for unique columns that do not reflect skew values across the join.
In some systems, a cardinality formula may be used to increase the join fanout estimate and account for some of the rows that will join but are not reflected in the FVL's. However, the estimates typically resulting from the cardinality formula may still underestimate the number of rows that will be joining. Therefore, it is desirable for FVL's to reflect as accurately as possible any skew values across the join. Without a better reflection of the skew values in the FVL's, the FVL's may lead to an underestimated join fanout, underestimated costs, the selection of a suboptimal access plan under the current runtime conditions by the query optimizer, and performance problems.
A need therefore exists in the art for an improved manner of building database statistics that more accurately reflects skew values in foreign tables, thus, resulting in statistics that more accurately reflect skew across joins and leading to the selection of better access plans and improved performance.