The present invention is directed to an improvement in computing systems and in particular to selectivity estimation for processing SQL queries containing HAVING clauses.
In a relational database a user views the data as a collection of tables. A user expresses a query against the database in a language such as SQL. The data in a relational database is maintained by, and operations on the data are carried out by way of, a relational database management system (RDBMS), such as the DB2 Universal Database(trademark). Where a user defines a query against the database, certain RDBMSs, such as DB2 Universal Database(trademark), invoke a component called an optimizer to choose a plan to access the data and produce the result. As part of the plan selection process, estimates of the size of the result at various stages in the plan are often utilized by the optimizer to choose a low cost plan. These are called selectivity estimates.
One particular type of SQL predicate which a user may specify in formulating a query is the HAVING clause. This refers not to individual rows of tables, but to groups of rows. A HAVING clause is a predicate that is applied to a grouping of rows and determines whether the group as a whole shall be included in or excluded from the query result or temporary table.
The xe2x80x9cselectivityxe2x80x9d of a HAVING clause is the fraction of the groups that satisfy the clause. Good a priori estimates of the selectivity of a HAVING clause are beneficial in choosing a lowest cost query plan, or one with cost close to the lowest. Such estimates are hard to obtain. Usually, the only information on which such estimates can be based is a small set of summary statistics about the data that is maintained in the system catalog by the RDBMS. Also, the time to calculate such estimates must not be exceed the time allowable for a given context in which the HAVING clause is being evaluated. In the prior art, there are known techniques for determining selectivity of certain queries (see, for example, U.S. Pat. No. 6,012,054, Seputis, describing the use of histograms to estimate selectivities). However, such techniques do not lend themselves to effectively estimating selectivity for a query including a HAVING clause and typically RDBMS optimizers assign an arbitrary constant value to the estimated selectivity for the HAVING clause, for example, 0.33333. Such a crude estimate is often vastly different from the actual selectivity and can lead to a bad choice of plan where the estimate is inaccurate.
It is therefore desirable to estimate the selectivity of a having clause in an SQL query in an efficient manner to permit an appropriate low-cost query plan to be chosen by the optimizer in an RDBMS.
According to one aspect of the present invention, there is provided an improved estimation of the selectivity of a HAVING clause in an SQL query.
According to another aspect of the present invention, there is provided a method for estimating the selectivity of a HAVING clause in an SQL query on a data table having data values, within a specified time constraint, the possible number of groups for the HAVING clause being assumed to have a known distribution between a specified minimum group size and a specified maximum group size, the distribution of the data values having a specified value distribution between a specified minimum data value and a specified maximum data value, the method including the steps of:
calculating a selectivity estimate for the HAVING clause for each member in a selected sample set of group sizes using a probabilistic model based on the specified value distribution, and
using the selectivity estimate for the members of the sample set of group sizes to define the selectivity of the HAVING clause to be a weighted estimate calculated to represent the selectivity for the possible number of groups for the HAVING clause.
According to another aspect of the present invention, there is provided the above method in which the calculation of a selectivity estimate for the HAVING clause for each member in a selected sample set of group sizes includes the steps of selecting between the estimation techniques of a fill convolution, a scaled convolution, a normal approximation, or a combination of those estimation techniques and then carrying out the calculation of the selectivity estimate using the selected estimation technique.
According to another aspect of the present invention, there is provided the above method in which the selection between the estimation techniques is determined by comparing a set of predicted processing times for each of a full convolution and a scaled convolution, for differing scaling factors, with the specified time constraint and the selection being dependent on the existence of a scaling factor which provides for processing of the estimation within the specified time constraint while maintaining the accuracy of the convoluted estimated selectivity within a specified accuracy.
According to another aspect of the present invention, there is provided the above method in which the weighted estimate is calculated using interpolation based on the known distribution of group sizes to determine selectivity for group sizes not contained in the sample set of group sizes.
According to another aspect of the present invention, there is provided a program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform the above method steps for estimating the selectivity of an SQL HAVING clause in a query.
According to another aspect of the present invention, there is a relational database management system including a selectivity estimation component for estimating, within a specified time constraint, the selectivity of a HAVING clause in an SQL query on a data table having data values, the possible number of groups for the HAVING clause being assumed to have a known distribution between a specified minimum group size and a specified maximum group size, the distribution of the data values having a specified value distribution between a specified minimum data value and a specified maximum data value, the selectivity estimation component including:
a member estimation component for determining an estimated selectivity for the HAVING clause for each member in a selected sample set of group sizes using a probabilistic model based on the specified value distribution, and
a combination component for defining the selectivity of the HAVING clause using the selectivity estimate for the members of the sample set of group sizes to calculate a weighted estimate representing the selectivity for the possible number of groups for the HAVING clause.
According to another aspect of the present invention, there is provided the above relational database management system in which the member estimation component includes a selection component for selecting between the estimation techniques of a full convolution, a scaled convolution, a normal approximation, or a combination of those estimation techniques and further includes an execution component for carrying out the calculation of the selectivity estimate using the selected estimation technique.
According to another aspect of the present invention, there is provided the above relational database management system in which the selection component includes a comparator component for comparing a set of predicted processing times for each of a full convolution and a scaled convolution, for differing scaling factors, with the specified time constraint and the selection between the techniques being dependent on the existence of a scaling factor which provides for processing of the estimation within the specified time constraint while maintaining the accuracy of the convoluted estimated selectivity within a specified accuracy.
According to another aspect of the present invention, there is provided the above relational database management system in which the combination component includes an interpolation component for calculating the weighted estimate using interpolation based on the known distribution of group sizes to determine selectivity for group sizes not contained in the sample set of group sizes.
Advantages of the present invention include permitting a low-cost access plan to be chosen by the optimizer in an RDBMS carrying out an SQL query on a data table by estimating the selectivity of a HAVING clause in the query.