In rule discovery in a database, a rule is expressed as CFD (Conditional Function Dependency) and a CFD rule(s) corresponding with contents of the database, out of generated CFD rule candidates, is(are) output. The following gives a brief survey of CFD which is to be a basis for understanding the invention.
CFD is a rule indicating that a functional dependency (abbreviated as “FD”) expressing dependency among data attributes, holds for a tuple set specified by a condition. CFD includes a conditional part and an antecedent (premise) part, which are on a left hand side (LHD) of the rule, and a consequent part on a right hand side (RHD) of the rule, to which attributes are respectively specified. The conditional part and the consequent part are also referred to as a conditional clause and a subordinate clause, respectively.
The conditional part specifies a subset of data (tuple set). The conditional part represents that an attribute X is of an attribute value x by using a notation “X=x”, where “x” means that the attribute is of a specified value. Such representation of the attribute value is termed “Constant” (where “Constant” means, for example, “a constant value”).
The antecedent part of the rule includes specification of only an attribute. The attribute value not taking a specified value (that is, a wildcard indicating matching to any value), is expressed as “X=_”. Such representation of the attribute value is termed “Variable”. Note that Variable means, for example, a variable. Here, “_” is termed “unnamed variable”.
There are two sorts of the consequent parts.
(A) a consequent part (such as that in the following rule 1) constituted by specifying an attribute and an attribute value; and
(B) a consequent part (such as that in the following rule 2) constituted by specifying only an attribute.
In the case of (A), a consequent part is represented as “A=a”, for example.
In the case of (B), a consequent part is represented as “A=_” or the like, for example. When the consequent part has the attribute value specified, the antecedent (premise) part is allowed to be omitted. The antecedent (premise) part and the consequent part may be constituted by specifying a plurality of attributes and respective attribute values of the attributes. The followings are examples of rules.
rule 1: X1→A (x1∥a)
rule 2: X1, X2→A (x1, _∥_)
The rule 1 is a rule that states: “If an attribute X1 is of an attribute value x1, the attribute A is of an attribute value a”. If the rule 1 holds, it means that, in a tuple set matched to the conditional part, the consequent part is of the specified value. That is, in all tuples of the tuple set satisfying the condition X1=x1, t[A]=a (where t[A] indicates a tuple of the attribute A). Such a rule in which the consequent part is determined at a specified value is termed a constant CFD (Constant CFD).
The rule 2 is a rule that states: “If an attribute X1 is of an attribute value x1, the attribute A is determined in accordance with the attribute X2”. If the rule 2 holds, it means that, in a tuple set matched to the conditional part, a dependency is present between the attributes specified by the antecedent part and the consequent part. That is, if, for any tuple pair t1 and t2 of the tuple set, satisfying the condition “X1=x1”, t1[X2]=t2[X2], then t1[A]=t2[A]. Such a rule in which the consequent part is not determined at a specified value but has a dependency between attributes is termed a variable CFD (Variable CFD). That is, such CFD in which the right hand side of ∥ in the pattern tuple is an unnamed variable ‘_’ (tp[A]=_), the rule is termed a variable CFD (Variable CFD).
The symbol ‘∥’ in the pattern tuple (x1∥a) of the rule 1 separates the attribute value of X1 on the left hand side from the attribute value of A on the right hand side. Note that there is also used a notation in which “X1→A (x1∥a)” of the rule 1 is denoted as “(X→A, (x∥a))”. Such notation differs only as to the presence of comma and the outer parenthesis, and the two obviously express the same rule. In similar manner, “X1, X2→A(x1, _∥_)” of the rule 2 may be alternatively denoted as “([X1, X2]→A, (x1, _∥_))”.
As indices indicating degree of effectiveness of CFD for given data, a support and a confidence, for example, are used. The support is the number of tuples to which a conditional part and an antecedent part of a CFD are both matched.
The confidence is a ratio of the number of tuples satisfying the CFD rule to the number of tuples to which a conditional part and an antecedent part are both matched
Given a plurality of CFDs, such CFD satisfying both conditions of left-reduced and most general is termed minimal. The following describes “left-reduced”. Given a plurality of CFDs, a CFD, an attribute set of the left-hand side (LHS) of which does not include an attribute set of the left hand side of any other CFD is said to be “left-reduced”.
When the following rules 3 and 4 are given, for example, the left hand side of the rule 4 includes the left hand side of the rule 3 (X1⊂X1, X2). Thus, the rule 4 is not “left-reduced”. Conversely, the left hand side of the rule 3 does not include the left hand side of the rule 4. Thus, the rule 3 is said to be “left-reduced”. In this case, the rule 4 may be deleted as a redundant CFD with respect to the rule 3.
Rule 3: X1, Y→A (x1, _∥—
Rule 4: X1, X2, Y→A (x1, x2, _∥_)
The following describes “most-general”. Given a plurality of CFDs, when a constant of an attribute value included in the left hand side of any CFD cannot be updated to ‘_’ (Variable), such a CFD is said to be “most-general”.
When the following rules 5 and 6 are given, for example, the rule 5 can be obtained by replacing an attribute value x2 of the rule 6 by a variable. Thus, the rule 6 is not “most-general”. Conversely, the rule 5 is “most-general”. In this case, the rule 6 is a redundant CFD with respect to the rule 5 and may be deleted as such.
Rule 5: X1, X2→A (x1, _∥a)
Rule 6: X1, X2→A (x1, x2∥a)
The foregoing is the outline of the CFD.
An apparatus that discovers a rule from a database includes a storage means (storage unit), such as a magnetic disc, in which a CFD is stored, a calculation means (calculation unit) that generates a CFD candidate and that decides whether or not the CFD candidate is matched to the contents of the database, and a saving means (saving unit) that stores the CFD, decided to match to contents of the database, in a storage apparatus or memory. The storage means stores the CFD obtained using the rule discovery algorithm. The calculation means generates a CFD candidate, as a subject to be checked, checks for whether or not it is matched to contents of the database and if matched, the calculation means outputs the CFD as being a valid CFD. The saving means stores the valid CFD obtained in the storage apparatus.
As techniques to discover a rule in a database, the following techniques are known, as disclosed in, for example, Non-Patent Literature 1:
(1) a technique of generating a candidate of a constant CFD from a free itemset and a corresponding closed itemset;
(2) a technique of generating a CFD candidate by generating a list of attribute-value pairs by a breadth first search, placing one of the terms in a subordinate term (indicated as A) and placing the other term in a conditional part (indicated as X) to obtain an expression:
X→A; and
(3) a technique of setting a free itemset as a conditional term, placing one attribute not included in the free itemset in a subordinate term (consequent part) and searching an attribute added to the conditional term by depth first search to generate a CFD candidate.
The free itemset is a set of items in which a frequency thereof is truly increased when removing any one or more item(s). An attribute-value pair appearing in a database is termed an item, and a set of items is termed an itemset.
As described above, there is a confidence among the indices indicating to what extent contents of a database coincide with a CFD.
Non-Patent Literature 2 discloses a technique for discovering a rule (CFD) which, though not fully coincident with contents of a database, has a high confidence value. According to this discovery technique, breadth first search is used to discover a CFD having a confidence greater than or equal to a threshold value. Such CFD is referred to as “approximate CFD” meaning that the CFD “substantially holds”.
To check for validity of a rule, there is disclosed a rule base management apparatus in, for example, Patent Literature 1. The rule base management apparatus includes a rule base to store a rule including a conditional part and a conclusion part, an instance information database to store the instance information concerning results of application of the rule, and a correlation unit to correlate a rule and the instance information satisfying the rule, a validity check unit to cause an instance retrieving unit to retrieve sets of instance information from the instance information database, using the conditional part of the rule to be checked for validity as a key, and calculate the ratio of the instance information satisfying the consequent part of the rule to check the validity of the rule in the set of the instance information based on the ratio thus found. There is also disclosed in Patent Literature 2 a configuration in which a functional dependency (FD) between attributes of a relation to perform normalization by relation splitting.