1. Field of the Invention
The present invention relates to an estimation method of the query selectivity which the database query optimizer in a database management system uses to find the most efficient execution plan among all possible plans, and more particularly, to a multi-dimensional selectivity estimation method using a compressed histogram information in which the compressed histogram information on multi-dimensional data distribution is stored by means of a schema manager and is used for the selectivity estimation, so that because of small-sized histogram buckets, a low error rate can be achieved, and because of compression of the information of a large number of histogram buckets, a low storage overhead can be also achieved.
2. Discussion of Related Art
Generally, in the database management system as shown in FIG. 1, the database query optimizer requires the estimation of the query selectivity to find the most efficient execution plan. There are two classes in selectivity estimation problems according to the dimensionality. One is the 1-dimensional selectivity estimation and the other is the multidimensional selectivity estimation. The 1-dimensional selectivity estimation technique is applied in case of a query with a single attribute or with multiple attributes independently from each other, for which a histogram method is practically used.
Data distribution is divided into small-sized non-overlapping buckets, in order to approximate and store the information of the data distribution, and statistics data for the interval of the buckets and the number of data on the bucket is called a histogram. The selectivity estimation using a histogram is as follows: First, all buckets, overlapping with the query are selected. The statistics in each bucket is used to compute the number of data that satisfy the query. The numbers of the satisfied data from each bucket are summed up to get the final estimation result. The histogram method is again classified into various methods according to how to partition the data distribution into buckets: the Equi-width, the Equi-depth, the V-optimal method, etc.
In the Equi-width, the widths of the buckets are equal, and the number of data in each bucket approximates the data distribution. In the Equi-depth, each bucket has the same number of data, so the widths of the buckets are different. When compared with the Equi-width method, the Equi-depth method is adequate in case of a high degree of skewness of data. In the V-optimal method, the sum of weighted variances of buckets is minimized. The V-optimal method has been shown to be the most accurate histogram method among the above methods [reference: Y. loannidis, V. Poosala. Balancing Optimality and Practicality for Query Result Size Estimation, ACM SIGMOD Conference 1995].
For queries referencing multiple attributes from the same relation, a multi-dimensional selectivity estimation technique is needed when the attributes are dependent each other because the selectivity is determined by the joint data distribution of the attributes.
There are proposed various methods for the multi-dimensional selectivity estimation technique as follows:
First, there is a selectivity estimation method using a correlation fractal dimension that is used for queries in a geographic information system[reference: A. Belussi, C. Faloutsos. Estimating the Selectivity of Spatial Queries Using `Correlation` Fractal Dimension. VLDB Conference 1995]. However, the selectivity estimation using the correlation fractal dimension can compute only the average of the estimation results for the same shape and size queries and cannot compute the estimation result for the query in a specified position. Additionally, the selectivity estimation can be practically used in two and three dimensions.
Secondly, there is an estimation method that uses a multi-dimensional file organization called the multilevel grid file(MLGF)[Reference: K. Y. Whang, S. W. Kim, G. Wiederhold. Dynamic Maintenance of Data Distribution for Selectivity Estimation, VLDB Journal Vol.3, No. 1, p29-51, 1994]. The MLGF partitions the multi-dimensional data space into several disjoint nodes, called grids, that act as histogram buckets. A new field, count, is added to each grid node for saving the number of data in the grid. The selectivity is estimated by accessing grid nodes overlapping with a query. This method supports dynamic data updates because MLGF itself is a dynamic access method, thus to reflect histogram information for the selectivity estimation immediately when data are updated. Therefore, in an environment where data is updated frequently, the overhead for periodical reconstructions of the histogram information can be eliminated. However, the MLGF suffers from the dimensionality curse that means severe performance degradation in high dimensions [Reference: S. Berchtold, C. Bohm, H. Kriegel. The Pyramid Technique: Towards Breaking the Curse of Dimensionality. ACM SIGMOD Conference 1998]. So, the method can not be applied in dimensions higher than three.
Third, there is proposed a Singular Value Decomposition(SVD) method. The SVD method decompose the joint data distribution matrix J into three matrices U, D, and V that satisfy J=UDV.sup.T. Large magnitude diagonal entries of the diagonal matrix D are selected together with their pairs, left singular vectors from U and right singular vectors from V. These singular vectors are partitioned using any one-dimensional histogram method. There are many efficient SVD algorithms, but the SVD method can be used only in two dimension.
Fourth, there is proposed a Hilbert Numbering method. The Hilbert numbering method converts the multi-dimensional joint data distribution into the 1-dimensional one and partitions it into several disjoint histogram buckets using any one-dimensional histogram method. The buckets made by this method may not be rectangles. Therefore, it is difficult to find the buckets that overlap with a query. The estimates may be inaccurate because it does not preserve the multi-dimensional proximity in 1-dimension.
Fifth, there are proposed the PHASED method and the MHIST method. The PHASED method partitions an n-dimensional space along one dimension chosen arbitrarily by the Equi-depth histogram method, and repeats this until all dimensions are partitioned. The MHIST method is an improvement to the PHASED method. It selects the most important dimension in each state and partitions it. From the V-optimal point of view as an applied partitioning method in MHIST, the dimension that has the largest variance is the most important. The experiments showed that the MHIST technique is the best among a variety of multi-dimensional histogram techniques [Reference: V.
Poosala, Y. E. Loannidis. Selectivity Estimation Without the Attribute Value Independence Assumption. VLDB Conference 1997]. However, even though it produces low error rates in 2-dimensional cases, it has relatively high error rates in the 3 or more dimensional space.
Meanwhile, in order to achieve low error rates in the histogram method, the size of histogram buckets must be small. As the dimension increases, however, the number of histogram buckets that can achieve low error rates increases explosively. This is because the number of histogram buckets is in inverse proportion to the dimension'th power to the normalized one-dimensional length of a partitioned multi-dimensional bucket as expressed by an equation below. ##EQU1##
Where, the condition 0&lt;d&lt;1 is satisfied, and d is the 1-dimensional length of a bucket.
It causes a severe storage overheads problem, which results in failure in sufficient small-sized buckets so as to have low error rates. Therefore, it is impossible to maintain a reasonably small storage with low error rates in high dimensions. Also it is difficult to partition a multi-dimensional space into disjoint histogram buckets efficiently so that the error rates are kept small. From a practical point of view, these methods cannot be used in dimensions higher than three.
Another problem is that all methods except the MLGF method cannot reflect dynamic data updates immediately to the statistics for the selectivity estimation. This leads to an additional overhead such as the periodical reconstruction of statistics for the estimation.