The present invention relates to the use of conditional functional dependencies (CFDs) to characterize the quality of data in relational schema.
Explaining first the notion of a functional dependency, let X and Y be subsets of a relational schema R. For example, the set of data shown in Table 4—called “SALES”—contains purchase records of an international retailer, with the following schema:                SALES (tid, name, type, vat, country, city)which means that each record of the schema comprises a 6-tuple having a transaction identifier tid and wherein a product with a given name and type was sold in a given country and city for a given price and charged a given value-added tax (VAT), or vat.        
A functional dependency (FD) X→Y asserts that any two tuples that agree on the values of all the attributes in a subset X of the attributes (the antecedent) must agree on the values of all the attributes in a subset Y of the attributes (the consequent). Thus if the attributes in X are “name” “type” and country” and the attributes in Y are “price” and “vat,” the functional dependency (FD) X→Y asserts that any given combination of item name, type and country should have the same price and the same vat. That is, all pairs of tuples with the same antecedent combination should have the same price and vat. Thus if (Smith, book, USA) has ($20, $1), then every other (Smith, book, USA) must have ($20, $1). Violations of FDs indicate inconsistencies in the data. Thus FDs are useful for characterizing data quality, with a fundamental issue being how to discern which tuples satisfy the FD and which do not.
FDs have traditionally been used in schema design and, as such, specify integrity constraints over entire relations. However, many interesting constraints hold conditionally, that is, on only a subset of the relation.
This brings us to the subject of the conditional functional dependencies, or CFDs, which have been proposed as a useful integrity constraint to characterize data quality and identify data inconsistencies.
Suppose that what we are interested in at a given time relative to the integrity of the data is to evaluate the extent to which the records of the table SALES meet both of the following constraints:                I. The records meet the above-noted FD, [name, type, country]→[price, vat], for that subset of records described by either of                    1. country=UK and type=clothing            2. country=France and type=book                        II. All books purchased in France are charged zero vat per French law.        
Looking only at the FD is not going to help us in the desired evaluation of the subset of interest because tuples that agree in X but not in Y are considered violations of the FD, even if they do not match the conditions that we care about—in this case the conditions on country and type. Moreover, so-called “dirty” tuples may go unnoticed. For example, if all books purchased in France have the same value of vat and that value is non-zero, the FD is satisfied but the records in question are problematic because the vat on books sold in France is supposed to be zero. Similar problems occur if a relation integrates data from multiple sources, in which case an FD may hold only on tuples obtained from one particular source.
Conditional functional dependencies (CFDs) address the foregoing. A CFD is composed of an embedded FD X→Y plus a so-called “pattern tableau” that defines those tuples that we care about as obeying the FD. As such, a conditional functional dependency is a construct that augments a functional dependency so as to define—by way of the pattern tableau—a subset of tuples on which the underlying FD should hold.
The pattern tableau is such that for any pattern (row) tp in the pattern tableau, if two tuples have the same values of attributes in X and these values match those in tp, then they must have the same values of attributes in Y and these values must match those in tp. The pattern tableau shown in Table. 1, for example, expresses constraints A and B above.
An underscore in the pattern tableau represents a match-all pattern, so that, for example, a standard FD is equivalent to a CFD with a single all-underscores row in the pattern tableau. Constants in the antecedent restrict the scope of the CFD, whereas constants in the consequent fix the values of the corresponding attributes of all matching tuples. In addition, pairs of tuples that do not match any pattern are not considered violations, even if they agree on the antecedent but not the consequent of the embedded FD.
Existing work on CFDs considered the problems of validating a given CFD on a relation instance, determining consistency and implications of multiple CFDs, and “repairing” the relation so that the given CFD is satisfied. However, these all assume a pattern tableau is supplied. What has not been addressed is how to create useful pattern tableaux—something which is needed to realize the full potential of the CFD construct. Indeed, it is not even obvious what design principles should guide the creation of a pattern tableau.
It is also desirable to be able to automate the process of generating the pattern tableau inasmuch as users may not be aware of all the specific constraints that hold over a given relation, this being due, for example, to schema and/or data evolution.
These are among the problems to which the present invention is directed