1. Field of Invention
The present invention relates generally to the field of relational database query optimization. More specifically, the present invention is related to correlation detection and dependency discovery between columns in a relational database.
2. Discussion of Prior Art
Dependencies between columns in relational databases can be exploited to optimize queries, but can also result in inaccurate estimates produced by query optimizers. Because query optimizers usually assume that columns are statistically independent, unaccounted for dependencies can lead to selectivity underestimation of conjunctive predicates by several orders of magnitude. Often-times a query optimizer in a relational database chooses a sub-optimal plan because of the inaccurate assumption of independence between two or more columns. Such an assumption is often made by query optimizers known in the art because it simplifies estimation; for example, the selectivity of conjunctive predicates based on two columns can be estimated by simply multiplying the individual selectivity of a column with the other.
Addressing these issues of relaxing statistical independence assumptions in selectivity estimations are query-driven and data-driven approaches. Query-driven approaches focus on information contained in a query workload, whereas data-driven methods analyze data values to discover correlations, or general statistical dependencies, between relational database columns. By stating the existence of a soft functional dependency (FD) between columns C1 and C2, a generalization of the classical notion of a hard FD in which a value in C1 completely determines a corresponding value in C2 is implied. A soft FD, denoted by C1C2, indicates that a value of C1 determines a corresponding value in C2 not with certainty, but with high probability. An example of a hard FD is given by “Country” and “Continent”; the former completely determines the latter. On the other hand, a soft FD between the make and model of a car is shown in the following example; given that “Model=323”, “Make=Mazda” with high probability and “Make=BMW” with small probability. Two types of trivial cases are also identified; a soft key having a small number of distinct values in a given column and a trivial column having either only null values or only a single distinct value in a given column. The values in any row of a trivial column are trivially determined by values in any other column, which leads to spurious correlations.
In non-patent literature “Exploiting Statistics on Query Expressions for Optimization”, Bruno and Chaudhuri disclose the use query workload (i.e. a list of relevant queries) together with optimizer estimates of query execution times for determining a beneficial set of Statistics of Intermediate Tables (SITS) to retain. SITS are statistics on query expressions that can be used to avoid large selectivity estimation errors due to independence assumptions.
Alternatively, a query feedback system (QFS) uses feedback from query execution to increase optimizer accuracy. In “LEO-DB2™'s learning optimizer” by Markl et al., DB2™ learning optimizer (LEO) is presented as a typical example of a QFS. LEO compares the actual selectivities of query results with a query optimizer's estimated selectivities. In this way, LEO can detect errors caused by faulty independence assumptions and create adjustment factors which can be applied in the future to improve the optimizer's selectivity estimates.
The self-adaptive histogram set (SASH) algorithm, disclosed in “A self-adaptive histogram set for dynamically changing workloads” by Lim et al., discloses another query-driven approach by creating clusters of disjoint columns in a relational database. Clustered columns are treated as being correlated whereas columns in different clusters are considered independent. In other words, SASH approximates the full joint distribution of the columns by maintaining detailed histograms on certain low-dimensional marginals in accordance with a high-level statistical interaction model. Joint frequencies are then computed as a product of marginals. Maintaining detailed histograms together with a high-level statistical interaction model can be very expensive, which limits the applicability of the SASH algorithm in commercial systems. As with other query feedback systems such as LEO, less optimal query plans can be chosen if the system has not yet received enough feedback, either during the initial startup period or after a sudden change in query workload. During one of these slow learning phases, a query optimizer is likely to avoid query plans with accurate feedback-based cost estimates in favor of other plans that appear to be less expensive, due to cost estimates based on limited quantities of actual data and faulty independence assumptions.
Most data-driven methods use discovered correlations to construct and maintain a synopsis (lossy compressed representation) of the joint distribution of numerical attributes. Getoor, et al., for example, use probabilistic relational models extending Bayesian network models to the relational setting for selectivity estimation in “Selectivity estimation using probabilistic models”. Deshpande, et al. provide a technique in “Independence is Good: Dependency-based histogram synopses for high-dimensional data” which, similarly to SASH, combines a Markov network model with a set of low-dimensional histograms. However, synopses are constructed based on a full scan of the base data rather from query feedback. Both of the foregoing techniques search through the space of possible models and evaluate them according to a scoring function. As with SASH, the high cost of these methods severely limits their practical applicability.
The method of Cheng et al. provided in “Learning belief networks from data: An Information theory based approach” typifies a slightly different approach to constructing a synopsis (specifically, a Bayesian network model) from the base data. Instead of searching through a space of possible models, the method assesses the dependency between pairs of columns by using conditional independence tests based on a “mutual information” measure. The method requires that all attributes be discrete and that there be no missing data values. The method is also not scalable as it requires processing of the entire dataset.
Mining association rules and semantic integrity constraints are approaches limited in that dependencies involve relationships between a few specific values of a pair of attributes, rather than an overall relationship between attributes themselves. For example, an association might assert that ten percent of married people between ages fifty and sixty have at least two cars. This rule concerns specific values of marital status, age, and number of cars attributes.
To account for statistical interdependence between columns of in a relational database, the “bump-hunting” system (B-HUNT) disclosed by Haas et al. in “BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data” searches for column pairs that might have interesting and useful correlations by systematically enumerating candidate pairs and simultaneously pruning candidates that do not appear promising by using a flexible set of heuristics. B-HUNT also analyzes a sample of rows in order to ensure scalability to larger relational databases. B-HUNT uses bump hunting techniques to discover soft algebraic relationships between columns having numerical attributes. General correlations between categorical values are not considered.
The non-patent literature by Brin, Motwani, and Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations” discloses the determination of “correlation rules” in a market-basket context. Brin et al. propose the use of chi-square tests to check for independence; the use of chi-square tests for testing independence is well-known in the art. Brin et al. do not disclose a setting that is for general numeric data and several types of general statistical dependencies, not a specialized type of dependency based on “market-basket” input transaction records. Brin et al. does not address sequential testing for different kinds of dependencies, arranging data in buckets such that the chi-square contingency table has an appropriate, data-dependent number of rows and columns, using data sampling to make algorithm scalable, systematically choosing likely pairs for analysis by combining exhaustive enumeration with heuristic pruning, and ranking the detected correlated column pairs, so that highest ranked pairs are first recommended to the query optimizer
U.S. Pat. No. 5,899,986 requires a workload to determine what set of statistics to create, and relies on information in the system catalog to determine what column groups to generate statistics on, as opposed to a sample. The patent is limited in that it does not provide for a determination of relationships between tables, and is not enabled to discover general correlations through a chi-square test, nor does it rank discovered correlations. Function dependence detection in the disclosed invention is based on information from a systems catalog, not on information based on a sample.
Prior art is limited in that there is no premise for determining correlations between general categorical attributes, nor is there a robust method for determining numerical correlations. Additionally, prior approaches are limited in their provision of correlation and dependency information to a query optimizer, a priori. By contrast, the present invention provides for prioritization of a priori column pair and statistic recommendations to a query optimizer with respect to any of: a degree of correlation, strength of dependency, or adjustment factor.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.