In rule discovery in a database, a rule is expressed as CFD (Conditional Function Dependency) and, out of CFD rule candidates generated, a CFD rule(s) corresponding with contents of the database is(are) output. The following gives a brief survey of CFD which is to be the basis for understanding the invention.
CFD is a rule indicating that a functional dependency (abbreviated as FD) expressing dependency among data attributes, holds for a tuple set specified by a condition. CFD includes a conditional part and an antecedent (premise) part, which are on a left hand side (LHD) of the rule, and a consequent part on a right hand side (RHD) of the rule, to which attributes are respectively specified. The conditional part is also referred to as a conditional clause, and the dependency or consequent part is also referred to as a subordinate clause.
The conditional part specifies a subset of data (tuple set). The conditional part represents that an attribute X is of an attribute value x by using a notation ‘X=x’, where ‘x’ means that the attribute is of a specified value. Such representation of the attribute value is termed ‘constant’, in which a constant means a constant value, as an example.
The antecedent part of the rule includes specification of only an attribute. The attribute value not taking a specified value (that is, a wildcard indicating matching to any value), is expressed as ‘X=_’. Such representation of the attribute value is termed ‘variable’. Note that the ‘variable’ means ‘a variable’, as an example. ‘_’ is here termed ‘unnamed variable’.
There are two sorts of the consequent part. These are:    (A) The consequent part composed by specifying an attribute and an attribute value (see a rule 1 below, as an example); and    (B) the consequent part that specifies only an attribute (see a rule 2 below, as an example).
The consequent part is expressed,
in the case of (A), as ‘A=a’, for example, and
in the case of (B), as ‘A=_’, for an example. In the case where the consequent part includes specification of an attribute value, the antecedent part may be omitted. There may also be such a case where the antecedent part and the consequent part are composed by a plurality of attributes and respective attribute values specified. The following shows example rules:    Rule 1: X1→A (x1∥a)    Rule 2: X1, X2→A(x1, _∥_)
The rule 1 is a rule that states: ‘If an attribute X1 is of an attribute value x1, the attribute A is of an attribute value a’. If the rule 1 is valid, it means that, for a tuple set matched to the conditional part, the consequent part is of a specified value. In short, in all tuples of the tuple set satisfying the condition X1=x1, t[A]=a, where t[A] indicates a tuple of the attribute A. Such a rule in which the consequent part is determined at a specified value is termed ‘constant CFD’ (Constant CFD).
The rule 2 is a rule that states: ‘If an attribute X1 is of an attribute value x1, the attribute A is determined in accordance with the attribute X2’. If the rule 2 is valid, it means that, in a tuple set matched to the conditional part, a dependency is present between the attributes specified by the antecedent part and the consequent part. That is, if, in any tuple pair t1 and t2 of the tuple set, satisfying the condition ‘X1=x1’, t1[X2]=t2[X2], then t1[A]=t2[A]. Such a rule in which the consequent part is not determined at a specified value but has a dependency between attributes is termed a ‘variable CFD’. That is, such CFD in which the right hand side of the pattern tuple ∥ is an ‘unnamed variable ‘_’ (Tp[A]=_), the rule is termed a variable CFD.
The symbol ‘∥’ in the pattern tuple (x1∥a) of the rule 1 separates the attribute value of X1 on the left hand side from the attribute value of A on the right hand side. Note that, although there is an instance where “(X1→A (x1∥a))” of the rule 1 is noted as “(x→A, (x∥a)”, such notations differ only as to the presence of comma and the outer parenthesis, the two obviously expressing the same rule. In similar manner, X1, X2→A(x1, _∥_) of the rule 2 is alternatively noted as ([X1, X2]→A, (x1, _∥_)).
As indices indicating the degree of effectiveness of CFD for given data, a support and confidence, for example, are used. The support is the number of tuples for which a conditional part and an antecedent part of a CFD are matched.
The confidence is the ratio of the number of tuples satisfying the CFD rule to the number of tuples for which a conditional part and an antecedent part are matched.
Given a plurality of CFDs, such CFD satisfying the two conditions of ‘left-reduced’ and ‘most general’ is termed ‘minimal’.
The following describes the meaning of ‘left-reduced’. Given a plurality of CFDs, a CFD, an attribute set of the left-hand side of which does not include an attribute set of the left hand side of any other CFD is said to be ‘left-reduced’.
For example, given the following rules 3 and 4:    Rule 3: X1, Y→A(x1, _∥_)    Rule 4: X1, X2, Y→A(x1, x2∥_)
the left hand side of the rule 4 includes the left hand side of the rule 3 (X1⊂X1, X2), so that the rule 4 is not ‘left-reduced’. Conversely, the left hand side of the rule 3 does not include the left hand side of the rule 4. Hence, the rule 3 is said to be ‘left-reduced’. In this case, the rule 4 is a redundant CFD with respect to the rule 3 and hence may be deleted as such.
The following describes the meaning of ‘most general’. If, given a plurality of CFDs, it is not possible to update a constant of an attribute value included in the left hand side of any CFD to ‘_’ (Variable), such CFD is said to be ‘most-general’.
For example, given the following rules 5 and 6:    Rule 5: X1, X2→A(x1, _∥a)    Rule 6: X1, X2→A(x1, x2∥a)
the rule 5 may be obtained by replacing the attribute value x2 of the rule 6 by a variable (Variable). Hence, the rule 6 is not ‘most-general’. Conversely, the rule 5 is said to be ‘most-general’. In such case, the rule 6 is a redundant CFD with respect to the rule 5 and hence may be deleted as such.
The foregoing is the outline of the CFD.
An apparatus that discovers a rule from a database includes a storage means (storage unit), such as a magnetic disc, in which a CFD is stored, a calculation means (calculation unit) that generates a CFD candidate and that decides whether or not the CFD candidate is matched to the contents of the database, and a saving means (saving unit) that stores the CFD, decided to match to contents of the database, in a storage device or memory. The storage means stores the CFD obtained using the rule discovery algorithm. The calculation means generates a CFD candidate, as a subject to be checked, checks for whether or not it is matched to contents of the database and if matched, the calculation means outputs the CFD as being a valid CFD. The saving means stores the valid CFD obtained in the storage device.
As techniques to discover a rule in a database, the following techniques are known, as indicated in e.g., Non-Patent Literature 1:    (1) A technique of generating a candidate of a constant CFD from a free itemset and a corresponding closed itemset;    (2) a technique of generating a CFD candidate by generating a list of attribute-value pairs by a breadth first search, placing one of the terms in a subordinate term (indicated as A) and placing the other term in a conditional part (indicated as X) to obtain an expression: X→A; and    (3) a technique of setting a free itemset as a conditional term (conditional part), placing one attribute not included in the free itemset in a subordinate term (consequent part) and searching an attribute added to the conditional term by depth first search to generate a CFD candidate.
The free itemset is a set of items in which the frequency is truly increased when removing any one or more item(s). An attribute-attribute-value pair appearing in a database is termed an item, and a set of items is termed an itemset.
As discussed in the foregoing, the confidence is among the indices indicating to which extent contents of a database coincide with a CFD.
Non-Patent Literature 2 discloses a technique for discovering a rule (CFD) which, though not fully coincident with contents of a database, has a high confidence value. According to this discovery technique, breadth first search is used to discover a CFD having a confidence greater than or equal to a threshold value. Such CFD is referred to below as ‘approximate CFD’ meaning that the CFD ‘substantially holds’.
To check for validity of a rule, there is disclosed a rule base management apparatus in, for example, Patent Literature 1. The rule base management apparatus includes a rule base to store a rule composed by a conditional part and a conclusion part, an instance information database to store the instance information concerning the results of application of the rule, and a correlation part to correlate a rule and the instance information satisfying the rule, a validity check part to cause an instance retrieving part to retrieve sets of instance information from the instance information database, using the conditional part of the rule to be checked for validity as a key, and calculate the ratio of the instance information satisfying the consequent part of the rule to check the validity of the rule in the set of the instance information based on the ratio thus found. There is also disclosed in Patent Literature 2 a configuration in which a functional dependency (FD) between attributes of a relation to effectuate normalization by relation splitting.
Patent Literature 1:
    International Publication No. WO2004/036496A1Patent Literature 2:    JP Patent Kokai Publication No. JP-H06-110749ANon-Patent Literature 1:    Wenfei Fan et al., “Discovering Conditional Functional Dependencies”, pp. 1231-1234, IEEE International Conference on Data Engineering, 2009, retrieved on Apr. 9, 2012, Internet URL http://homepages.inf.ed.ac.uk/fgeerts/pdf/icde09.pdfNon-Patent Literature 2:    Chiang et al., “Discovering Data Quality Rules”, in VLDB, 2008, retrieved on Apr. 9, 2012, Internet URL http://dblab.cs.toronto.edu/˜fchiang/docs/vldb08.pdf