1. Field of Invention
The present invention relates generally to the field of databases. More specifically, the present invention is related to an algorithm to automatically identify algebraic constraints between pairs of columns in relational data.
2. Discussion of Prior Art
Commercial DBMS vendors increasingly view autonomic and self-managing technologies as crucial for maintaining the usability and decreasing the ownership costs of their systems. Self-tuning database systems have also been receiving renewed attention from the research community (see, for example, the paper by Weikum et al. entitled, “Self-tuning database technology and information services: from wishful thinking to viable engineering”, and references therein). Query optimizers that actively learn about relationships in the data are an important component of this emerging technology.
Previous work on automatic methods for learning about data relationships can be categorized according to whether the learning technique is query- or data-driven, and according to the type of information discovered. Query-driven techniques have the property that the mined information is, by definition, directly relevant to the user's needs and interests. This narrowed focus often leads to high accuracy. On the other hand, query-driven techniques can result in poor performance during the “warm-up” stage of query processing in which not enough queries have been seen yet. Similar problems arise when the workload starts to change, or when processing a query that is unlike any query previously seen. Indeed, use of query-driven techniques can cause a learning optimizer to “careen towards ignorance” by preferring query plans about which less is known, even if the plans are actually quite inefficient. The reason for this preference is that, in the absence of solid information, an optimizer usually underestimates the cost of a plan, for example, by making unrealistic independence assumptions. Data-driven techniques, though often less precise, complement query-driven techniques and can ameliorate their shortcomings.
One useful type of information about relationships in data is the multidimensional distribution of a set of attributes. A variety of data-driven techniques have been developed for producing “synopses” that capture such distributions in a compressed form; (see, for example, the following papers/reports and references therein: (a) Barbara et al. in report entitled, “The New Jersey data reduction report”; (b) Deshpande et al. in the paper entitled, “Independence is good: Dependency-based histogram synopses for high-dimensional data”; (c) Garofalakis et al. in the paper entitled, “Wavelet synopses with error guarantees”; and (d) Poosala et al. in the paper entitled, “Selectivity estimation without the attribute value independence assumption”). These methods are based on a scan or sample of the database, which can be initiated by the user or by the system. The methods have somewhat less of an autonomic feel than query-driven methods, because typically the user must specify which attributes to include in each synopsis. Also, methods for maintaining and exploiting synopses are typically expensive and complicated and therefore are hard to implement in commercial database systems.
A number of researchers have provided methods for maintaining useful statistics on intermediate query results such as partial joins. The LEO learning optimizer, for example, improves cardinality estimates for intermediate results by observing the data returned by user queries (see paper by Stillger entitled, “LEO—DB2's LEaring Optimizer”). Techniques proposed by Bruno and Chaudhuri (see paper by Bruno et al., “Exploiting statistics on query expressions for optimization”) determine the “most important” statistics on intermediate query expressions (SITs) to maintain based on a workload analysis.
The information provided by the foregoing techniques is used by the optimizer to improve the cost estimates of the various access plans under consideration. An alternative set of techniques provides information to the optimizer in the form of rules or constraints. The optimizer can directly use such information to consider alternative access paths. Important types of constraints include functional dependencies, multi-valued dependencies, and semantic integrity constraints.
Two columns a1 and a2 of categorical data obey a functional dependency if the value of a1 determines the value of a2. A typical example of a functional dependency occurs when a1 contains car models and a2 contains car makes. For example, a car model value of Camry implies a car make value of Toyota. A multi-valued dependency is a generalization of a functional dependency that in effect provides a necessary and sufficient condition under which a relation can be decomposed into smaller normalized relations. Mining of functional and multi-valued dependencies is discussed in various papers (see the following papers: (a) the paper by Bell et al. entitled, “Discovery of constraints and data dependencies in databases”; (b) the paper by Huhtala et al. entitled, “TANE: An efficient algorithm for discovering functional and approximate dependencies”; (c) the paper by Petitet al. entitled, “Towards the reverse engineering of denormalized relational databases”; and (d) the paper by Wong et al. entitled, “Automated database schema design using mined data dependencies”).
Semantic integrity constraints arise in the setting of semantic query optimization. For example, Siegel et al. in the paper entitled, “A method for automatic rule derivation to support semantic query optimization” and Yu et al. in the paper entitled, “Automatic knowledge acquisition and maintenance for semantic query optimization”, consider query-driven approaches for discovering constraints of the form A B and JC (A B), where JC is a join condition, and A B is a rule such as s.city=chicago t.weight>200.
The above-mentioned prior art techniques are closely related to techniques used in reverse engineering and discovery of entity-relationship (ER) models for legacy databases (see, for example, the following papers and references therein: the paper by Bell et al. entitled, “Discovery of constraints and data dependencies in databases” and the paper by Petit et al. entitled, “Towards the reverse engineering of denormalized relational databases”). Many of these algorithms rely on information contained in the schema definition—such as primary-key declarations—or in a set of workload queries. Algorithms such as those described in Bell et al. and Petit et al. execute a sequence of queries involving joins and COUNT(DISTINCT) operations to discover inclusion dependencies—an inclusion dependency exists between columns a1 and a2 if every value that appears in a2 also appears in a1.
The following patents/references provide for a general teaching in the area of data mining, but they fail to provide for the limitations of the present invention's method.
The U.S. patent publication to Carlbom et al. (U.S. 2003/0023612) discloses a system performing data mining based upon real-time analysis of sensor data. The performance data mining system combines detailed sensor analysis data with other data sources to discover interesting patterns/rules for performance and utilizes real time sensor analysis to dynamically derive mining results in real time during an event. The system described in Carlbom et al. automatically generates advice/strategy and predictions based on specified criteria.
The U.S. patent publication to Wolff et al. (2002/0198877) provides for a method for mining association rules in a database that is divided into multiple partitions associated with respective computer nodes. The method of Wolff et al. includes transmitting messages among the nodes with respect to local support of an itemset in the respective partitions of the database. Responsive to the messages transmitted by a subset of the nodes, the itemset is determined to be globally frequent in the database before the nodes outside the subset have transmitted the messages with respect to the local support of the itemset in their respective partitions. An association rule is computed with respect to the itemset, responsive to having determined the itemset to be globally frequent.
The U.S. patent to Wang et al. (U.S. Pat. No. 6,415,287) provides for a method and system for mining weighted association rule. Wang et al. extend the traditional association rule problem by allowing a weight to be associate with each item in a transaction to reflect interest/intensity of each item within the transaction. The weighted association rules from a set of tuple lists are discovered, where each tuple consists of an item and an associated weight and each tuple list consists of multiple tuples.
The U.S. patent to Mitsubishi et al. (U.S. Pat. No. 6,385,608) discloses a method and apparatus for discovering association rules. A candidate-itemset generating unit generates a candidate-itemset composed of at least one candidate item to be included in the left hand side or the right hand side of the association rule. A candidate-itemset verifying unit selects itemsets having frequencies (appearing times in the database) more than the minimum frequency out of the candidate-itemsets, as large-itemsets. A candidate rule generating unit generates candidate association rules based on a large-itemset of k-1 long and a large-itemset of 1 long. A chi-square testing unit generates an association rule set based on the candidate association rules.
The U.S. patent to Ozden et al. (U.S. Pat. No. 6,278,998) discloses a system and method for discovering association rules that display regular cyclic variation over time. Such association rules may apply over daily, weekly or monthly (or other) cycles of sales data or the like. A first technique, referred to as the sequential algorithm, treats association rules and cycles relatively independently. Based on the interaction between association rules and time, Ozden employs a technique called cycle pruning, which reduces the amount of time needed to find cyclic association rules. A second technique, referred to as the interleaved algorithm, uses cycle pruning and other optimization techniques for discovering cyclic association rules with reduced overhead.
The U.S. patent to Mahajan et al. (U.S. Pat. No. 6,236,982) discloses a method that uses calendars to describe the variation of association rules over time, where a specific calendar is defined as a collection of time intervals describing the same phenomenon.
The U.S. patent to Aggarwal et al. (U.S. Pat. No. 5,943,667) discloses a computer method for removing simple and strict redundant association rules generated from large collections of data. The U.S. Pat. No. 6,061,682, to Agrawal et al., provides for a method and apparatus for mining association rules having item constraints. The U.S. Pat. No. 5,842,200, also to Agrawal et al., provides for a system and method for parallel mining of association rules in database.
The Japanese patent to Shigeru et al. (JP 2001-344259) provides for an incremental mining method which increases the data mining speed at the time of data addition or deletion.
The paper by Czejdo et al., entitled “Materialized views in data mining,” discloses the use of materialized views in the domains of association rules discovery and sequential pattern search.
The paper by Lee et al. entitled, “On Mining General Temporal Association Rules in a Publication Database,” discloses a progressive partition miner, wherein the cumulative information of mining previous partitions is selectively carried over toward the generation of candidate itemsets for the subsequent partitions.
The paper by Bosc et al. entitled, “On some fuzzy extensions of association rules,” discloses the semantics of two fuzzy extensions of the classical concept of an association rule.
The paper by Manning et al. entitled, “Data allocation algorithm for parallel association rule discovery,” discloses an algorithm that uses principal component analysis to improve data distribution prior to fast parallel mining.
The paper by Srikant et al. entitled, “Mining quantitative association rules in large relational tables,” discloses techniques for mining in large relation tables containing both quantitative and categorical attributes.
The paper by Tsai et al. entitled, “Mining quantitative association rules in a large database of sales transactions,” discloses partition algorithms for partitioning data and a scheme to discover all the large itemsets from the partitioned data.
The paper by Godfrey et al. entitled, “Exploiting Constraint-Like Data Characterizations in Query Optimization,” discloses advantages of optimizing queries in a database like DB2 given a set of integrity constraints. The paper by Gryz et al. entitled, “Discovery and Application of Check Constraints in DB2,” discloses advantages of identifying regularities in data stored in a database such as DB2.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.