The present invention relates to a method and a system for data analysis of a database and a data warehouse. In particular, the invention relates to data mining, i.e. analysis of any hidden association of data in database.
In data mining for mining useful information buried in an enormous amount of data by analyzing the data, an association rule is discovered, which indicates association of data. For instance, by mining an association rule in the analysis of receipt data, i.e. a record of purchase of customer, in retail business, it is possible to know the purchase pattern of each customer. When a rule is mined, which means: xe2x80x9ca customer who purchased commodity A and commodity B at the same time is very likely to buy commodity C and commodity D at the same timexe2x80x9d, it is found that commodities A and B are related to the sale of commodities C and D, which is useful in determining the sales policy, e.g. in display and arrangement of commodities or selection of special price commodity.
According to the conventional type association rule, number of appearances of a specific item is counted in a group of data, which comprises a plurality of items, and at least one rule of association between the items is obtained. Description is given below on examples. When a database comprising 5 items of A1, A2, A3, A4 and A5 is given as shown in FIG. 2, an association rule between the items as shown in the equation 5 is obtained:
(A1=yes)∩(A3=yes)(A5=yes)xe2x80x83xe2x80x83(5) 
This is an association rule, which means: xe2x80x9cIn the data where items A1 and A3 appear, item A5 is also likely to appear very often.xe2x80x9d This is called an xe2x80x9cassociation rulexe2x80x9d. The association rule is expressed as:
(Assumption)Conclusion)xe2x80x83xe2x80x83(6) 
Each of the assumption and the conclusion comprises at least one items respectively. In case of (equation 5), the assumption is: (A1=yes)∩(A3=yes), and the conclusion is (A5=yes).
The association rule has two indices: support and confidence. Support is an amount of data, to which an association rule is applicable, i.e. a ratio of data where a combination of items in the association rule appears. Also, the number of times of appearances of the combination of items is called a xe2x80x9csupport countxe2x80x9d. Confidence is an accuracy index of the association rule, i.e. the ratio, at which the conclusion is satisfied when the assumption of association rule is satisfied.
For instance, in case of the association rule given in the equation 5 mined from the database of FIG. 2, it is evident that the total number of records is 5, and the number of records supporting the association rule of the equation 5 is 2. Thus, the support count is 2 and the support is 0.4(40%). With regard to confidence, it is evident that number of records where all items in the association rule of the equation 5 appear is 2 and that number of records where the items of the assumption appear is 3. Thus, confidence is about 0.67(67%).
The studies on mining of the association rule have been performed in the field of data mining. For instance, the studies are described in the following references:
(1) R. Agrawal, T. Imielinski, and A. Swami: xe2x80x9cMining association rules between sets of items in large databasesxe2x80x9d; Proceedings of the ACM SIGMOD Conference on Management of Data, May 1993. (Reference 1)
(2) R. Agrawal and R. Srikant: xe2x80x9cFast algorithms for mining association rules; Proceedings of the 20th VLDB Conference, 1994. (Reference 2)
According to these methods, from the database comprising a plurality of items, all association rules are mined, which satisfy minimum support and minimum confidence, i.e. minimum values of support and confidence set by the user. These methods are based on the property that support of a product set of items (hereinafter referred as xe2x80x9citemsetxe2x80x9d) cannot be smaller than the support of an itemset (Y⊃X) contained in X, and pruning candidates of an association rule is performed. This method can be achieved according to the block diagram of FIG. 3. First, a database is retrieved, and the number of appearances of each of the items is counted (support count), and any item which satisfies minimum support is picked up. These are called xe2x80x9cfrequent itemsxe2x80x9d. Next, 2 items in the frequent items are combined, and an itemset is prepared, for which both support count and support are not known. This itemset is called xe2x80x9ccandidate itemsetxe2x80x9d. The database is retrieved again, and support count of each of the candidate itemsets is counted, and an itemset is obtained, which satisfies minimum support. The itemset satisfying the minimum support is called xe2x80x9cfrequent itemsetxe2x80x9d. Then, from the frequent itemset containing (k-1) items, itemsets are combined, which have the leading (k-2) items equal to each other, and a candidate itemset comprising xe2x80x9ckxe2x80x9d items is prepared. In this case, based on the property of the above-mentioned itemset, pruning of the candidate itemset is performed. Specifically, when a combination of items not present in the frequent itemset comprising (k-1) items is contained in the candidate itemset, which comprises xe2x80x9ckxe2x80x9d items prepared above, the candidate itemset is deleted. Next, the support count of each of the candidate itemsets is counted by retrieving the data from the database. The itemset to satisfy minimum support is picked up, and a frequent itemset is obtained. This process is repeated for each number of items, and this is continued until the newly picked frequent itemset turns to be empty. When the process to obtain the frequent itemset has been completed, all frequent itemsets to satisfy minimum support has been picked up. Next, using the frequent itemset thus obtained, all association rules are derived. From the frequent itemset when the number of items xe2x80x9ckxe2x80x9d is 2 or more, at least one association rule is derived, which has xe2x80x9cnxe2x80x9d items in an assumption and (k-n) items in a conclusion. Here, xe2x80x9cnxe2x80x9d is an integer of less than xe2x80x9ckxe2x80x9d. For instance, from a frequent itemset ABC having 3 items, 6 association rules can be derived: {A, B}{C}, {A, C}{B}, {B, C}{A}, {A}{B, C}, {B}{A,C}, and {C}{A, B}. As soon as the association rules are derived, the confidence of the association rule is calculated from the supports of the itemset in the association rules and the itemset in the assumptions, and only the association rules are mined, which satisfy minimum confidence set by the user. For instance, in case the support of the itemset {A, B, C} is 0.3 and support of the itemset {A, C} is 0.4, confidence of the association rule {A, C}{B} is calculated as 0.75 (75%).
The method for efficiently mining the frequent itemset is described, for instance in: J. Han, J. Pei, and Y. Yin: xe2x80x9cMining frequent patterns without candidate generationxe2x80x9d; Proceedings of ACM SIGMOD International Conference on Management of Data, 2000(Reference 3). In the reference 3, database is retrieved at first, and a frequent item satisfying a minimum support value is searched. Again, the database is retrieved, and a tree structure containing only the frequent items in the records of the database is constructed, and this is stored in the main memory. By analyzing only the tree structure in the main memory, all frequent itemsets containing 2 or more items are mined.
The association rules mined by the methods of the references 1, 2 and 3 are only the rules, i.e. affirmation. For instance, from the database shown in FIG. 2, the association rule {A1,A3}{A5} is mined by the above method. This means: (A1=yes)∩(A3=yes)(A5=yes). That is, when A1 and A3 are contained in the data, it is a relation where A5 is also contained. Here, the fact that a certain item A is contained in the data, i.e. a item expressed as (A=yes) is called xe2x80x9caffirmative itemxe2x80x9d. Also, an expression that the item A is not contained in the data, i.e. a item, which means the reverse of the affirmative item and is expressed as (A=no), is called xe2x80x9cnegative itemxe2x80x9d.
The method for mining association rules with negative items is described, for instance, in:
(1) A, Savasere, E. Omiecinski, and S. Navathe: xe2x80x9cMining for strong negative associations in a large database of customer transactionsxe2x80x9d; Proceedings of International Conference on Data Engineering, 1998. (Reference 4)
(2) Japanese Patent Publication No. 11-328186: xe2x80x9cMethod for generating association rule and system for generating association rulexe2x80x9d. (Reference 5)
According to these references, the association rules are mined, which contain only affirmative items in the assumption or only negative items in the conclusion, e.g. an association rule (A=yes)∩(B=yes)(D=no), meaning that xe2x80x9cwhen phenomena A and B occurs, phenomenon D does not occurxe2x80x9d.
In the reference 4, classification hierarchy of items is used. An expected value of support count of the itemset in the conclusion is obtained. If the support count in the database is smaller than the expected value, it is mined as an association rule of negation. For instance, when the item A is an upper hierarchy of the items B and C and support of an itemset AX is 10% for an item X in a different hierarchy, the expected value of the support of each of the itemsets BX and CX is calculated as 5% respectively. Here, if the support of actual itemset BX is 2%, the itemset is smaller than the expected value of the support. Thus, from the itemset, the following association rule is derived:
Xxcx9cBxe2x80x83xe2x80x83(7) 
According to the reference 5, confidence is calculated, and a X2 test (chi square test) is performed from the support of the itemset in the assumption and from the support of the itemset in the conclusion, and an association rule statistically significant (although low in confidence) is mined.
As described above, according to the conventional methods, an association rule containing only affirmative items, or an association rule containing only affirmative items in the assumption and only negative items in the conclusion is mined, and no consideration is given on the xe2x80x9cassociation rule, which contains negative items in the assumptionxe2x80x9d. This association rule may offer important information, which is not indicated by the conventional simple association rules. For instance, when consideration is given only on affirmative items from a manufacture data, a rule is mined as: xe2x80x9cWhen materials A and B are used as raw materials for a product, non-defective percentage is 70%.xe2x80x9d When consideration is given on negative items, a rule may be mined as: xe2x80x9cWhen materials A and B are used but material X is not used, non-defective percentage is 95%.xe2x80x9d By giving consideration on negation, it is possible to mine more accurate knowledge. In the field of gene analysis, regarding data such as ratio of expression pattern, SNPs, diseases, effect of drug, blood sugar value, etc., side effects of the drug can be prevented if the following knowledge is obtained: xe2x80x9cIn case 1000-th base is A (adenine), 1100-th base is T (thymine), and 1550-th base is not G (guanine) in a certain human gene, drugs X and Y have effects, but drug Z has no effectxe2x80x9d, or xe2x80x9cIn case a marker gene X is homo and a marker gene Y is not homo, disease Z is more likely to develop.xe2x80x9d It is also possible to administer drugs or perform treatments by giving consideration on individual difference depending on human genotypes.
In the analysis of problems where a great number of phenomena are related with each other in complicated manner, it is necessary to give consideration not only on affirmative items but also on negative items and to mine correlation where these are related to each other. However, no method has been proposed so far, which is used for efficiently mining the association rule. As the association rule containing negative items, a method has been proposed to mine the association rules which have only negative items in the conclusion. However, by this method, it is not possible to find an association rule where affirmative items and negative items are mixed together. Also, it is possible, in principle, to consider the negative items by the method for mining conventional association rule, which gives consideration only on affirmative items. However, enormous number of data combinations must be checked as candidates for the association rules, which is practically impossible in fact under the prior art scheme. According to the conventional method, pruning is performed on the candidate itemsets to be retrieved by utilizing property of the support between the itemsets. When consideration is given only on the affirmative items, this pruning method is effective. The pruning method is also applicable to the itemset containing negative items. However, no effect is obtained for the itemset which has both affirmative and negative items. For instance, it is supposed that an item B is contained in the database and the support of the item B does not satisfy the minimum support condition set by the user. Here, it is assumed that the minimum support is less than 50%. When only the affirmative items are considered, the combination containing the item B is pruned, and it is not contained in the candidate itemset. However, the support of B which means negation, i.e. negative item of B (B=no), satisfies minimum support. Therefore, when negative items are considered, the item B is present as a negative item such that it is not pruned and it is contained in the candidate itemset. Thus, the number of candidate itemsets to be retrieved stays the same. As a result, the candidates to be retrieved will be enormous. The method of the reference 3 cannot be used for mining frequent itemsets containing negative items unless the types of the items in the database to be processed are extremely small. According to the method of the reference 3, only the frequent items are considered, and it is assumed that the number of the frequent items is not many and that the database can be converted to a small tree structure, which can be maintained in the main memory. As described above, when negative items are considered, the number of items to be processed cannot be reduced, and the assumption of the reference 3 is not feasible.
It is a first object of the present invention to provide a method and a system for mining association rules containing negative items.
It is a second object of the present invention to provide a system, which decreases number of candidate itemsets, i.e. a combination of the data to be retrieved.
One of the features of the method for mining association rules according to the present invention is that an association rule is mined from a database comprising data with discrete values, the association rule satisfying a support and a confidence of the association rule designated by a user and a minimum support, a minimum confidence, and a minimum confidence increment, i.e. a minimum value of confidence by adding negative items, said association rule containing at least one negative item and at least one affirmative item in an assumption, and at least one affirmative item in a conclusion.
According to one aspect of the invention, the method for mining at least one association rule based upon data in a database containing attributes, comprising: defining each of the attributes with a respective attribute values as an item, and defining the database as a plurality sets of items, and defining each of the sets of items in the database as a record; defining an item appearing in one of the sets as an affirmative item in the set, and defining an item absent from the set as a negative item in the set; and mining the association rule by applying at least one logical AND operation on an assumption of one association rule containing at least one negative item and at least one affirmative item so as to obtain a conclusion of the association rule containing at least one affirmative item.
The above-mentioned method may further comprises the steps of applying at least one logical AND operation on at least two items to generate an itemset; providing plurality sets of itemsets to be included in the database; defining a support count of one item or one itemset as a number of times the item or the itemset appearing in different records of the database; defining a support count of one association rule as the support count of the itemset in (the assumption AND the conclusion); calculating a support of the item, the itemset or the association rule by dividing the support count of the item, the itemset or the association rule with a total number of records in the database; calculating a confidence of the association rule by dividing a support count of the itemset in (the assumption AND the conclusion) with a support of the item or the itemset in the assumption; and for the association rule containing at least one negative item in the assumption, calculating a confidence increment by omitting the negative item in the assumption of the association rule thereby calculating a confidence of a simplified association rule, which is obtained by omitting the negative item from the assumption of the association rule, to obtain an evaluation value, and dividing the confidence of the association rule with the evaluation value.
According to a first particular aspect of the invention, the method may further comprise the steps of dividing the database into a plurality of partitions in possible ways generating a bit vector for each affirmative items in the database based upon whether the item appears in each of the partitions by assigning 1 to a p-th bit if the candidate itemset appears in the p-th partition or 0 if the candidate itemset is absent from the p-th partition; and disregarding a candidate itemset if none of the partitions having 1 in the bit vector for all items contained in the candidate itemset.
According to a second particular aspect of the invention, the method may further comprises the steps of dividing the database into a plurality of partitions in possible ways; and disregarding a candidate itemset if none of the partitions contains all items contained in the candidate itemset. According to a third particular aspect of the invention, the method may further comprise a first set of additional steps including: inputting a minimum value of support, a minimum value of confidence, and a minimum value of confidence increment of the association rule, whereby the mining step, frequent itemsets, which satisfies the minimum value of support, are used to sort out association rules, which consist of frequent itemsets and satisfy the minimum value of confidence as well as the minimum value of confidence increment.
Alternatively, the method may further comprise a second set of additional steps including: inputting a minimum value of support, a minimum value of confidence, and a minimum value of confidence increment of the association rule; providing a itemset X containing at least two affirmative items and being applied upon a logical AND operation in conjunction with a negative item xcx9ca to generate a candidate itemset Xxcx9ca so as to provide a plurality of candidate itemsets; calculating an upper bound of confidence of a candidate itemset Xxcx9ca by estimating a confidence value for each of the association rule, which a candidate itemset Xxcx9ca can generate, by calculating a support count of each of subset itemsets Xxe2x80x2 of the frequent itemset X (xe2x80x9cSupp(Xxe2x80x2)xe2x80x9d), a support count of an inverted item a of the negative item xcx9ca in Xxcx9ca (xe2x80x9cSupp(a)xe2x80x9d), and a support count of X (xe2x80x9cSupp(X)xe2x80x9d), selecting the smallest value in {Supp(Xxe2x80x2)}, and selecting a bigger value in {(the smallest value of Supp(Xxe2x80x2))-Supp(a),Supp(X)}; sorting out candidate itemsets, each of which upper bound of confidence is bigger than the minimum value of confidence thereby compiling the frequent itemsets; and sorting out frequent itemsets from the plurality of candidate itemsets, which satisfies the minimum value of support thereby compiling association rules, which consist of frequent itemsets.
The method may further comprise a third set of additional steps including: inputting a minimum value of support, a minimum value of confidence, and a minimum value of confidence increment of the association rule; providing an itemset X containing at least two affirmative items and being applied upon a logical AND operation in conjunction with an itemset xcx9cA containing at least two negative items to generate a candidate itemset Xxcx9cA so as to provide a plurality of candidate itemsets; calculating an upper bound of confidence of a candidate itemset Xxcx9cA by estimating the confidence value for each of the association rules, which a candidate itemset Xxcx9cA can generate, by calculating a support count of each of the subset itemset Xxe2x80x2 of the frequent itemset X (xe2x80x9cSupp(Xxe2x80x2)xe2x80x9d), the sum of support counts of inverted items in the itemset xcx9cA (xe2x80x9cSum(A)xe2x80x9d), and a support count of X (xe2x80x9cSupp(X)xe2x80x9d), selecting the smallest value of {Supp(Xxe2x80x2)}, and selecting bigger value in {(the smallest value of Supp(Xxe2x80x2))-Sum(A),Supp(X)}; sorting out candidate itemsets, each of which upper bound of confidence is bigger than the minimum value of confidence thereby compiling the frequent itemsets; and sorting out frequent itemsets from the plurality of candidate itemsets, which satisfies the minimum value of support thereby compiling association rules, which consist of frequent itemsets.
The systems or apparatus for implemented the afore-mentioned methods are intended to be included in the invention.
Another feature of the present method comprises the following steps:
(1) a step for calculating a support count of the itemset Xxcx9cY by combining the support counts of the itemsets X and xcx9cY contained in the itemsets X and Y, the itemsets X and Y containing only affirmative items, and an itemset xcx9cY containing only negative items, as follows:                                           Supp            ⁡                          (                                                x                  1                                ⋂                …                ⋂                                  x                  m                                ⋂                                  ~                                      y                    1                                                  ⋂                …                ⋂                                  ~                                      y                    n                                                              )                                ⁢                      
                    =                                    Supp              ⁡                              (                                  X                  ~                  Y                                )                                      ⁢                          
                        =                                          Supp                ⁡                                  (                  X                  )                                            +                                                ∑                                      i                    =                    1                                    n                                ⁢                                                      ∑                                                                  z                        ⋐                                                  y                          1                                                                                                                      "LeftBracketingBar"                          Z                          "RightBracketingBar"                                                =                        i                                                                              ⁢                                      {                                                                                            (                                                      -                            1                                                    )                                                i                                            xc3x97                                              Supp                        ⁡                                                  (                          XZ                          )                                                                                      }                                                                                      ⁢                  xe2x80x83                                    (Equation  8)            
where
wherein a set of attribute values is defined as a record, a database is defined as a set of records, which also serves as a list of items. An item appearing in a record is defined as an affirmative item in the record, and an item not appearing in the record is defined as a negative item in the record. A product set of a plurality of items is defined as an itemset. A support count of one item or one itemset is a number of times the item or the itemset appearing in different records of the database. A support count of one association rule is the support count of the itemset in (Xxcx9cY). A support of the item, the itemset or the association rule is obtained by dividing the support count of the item, the itemset or the association rule with a total number of records in the database.
(2) a step for deriving an association rule (X-A) xcx9cYA containing at least one affirmative item and at least one negative item in an assumption, and at least one affirmative item in a conclusion, and further, an association rule (X-A)(xcx9cY-xcx9cB)Axcx9cB containing negative items also in the conclusion from the above itemset Xxcx9cY;
(3) a step for preparing a candidate itemset Xxcx9ca with support count unknown by combining a negative item xcx9ca with a frequent itemset comprising only affirmative items where an itemset satisfying a minimum support is defined as a frequent itemset, and for calculating a condition where upper bound of confidence of the association rule derivable from the candidate itemset and containing a negative item xcx9ca in the assumption does not satisfy minimum confidence, using                               MaxConf          =                                    Supp              ⁡                              (                X                )                                                                                                          max                    ⁢                                          {                                                                        min                          ⁢                                                      {                                                                                          Supp                                ⁡                                                                  (                                                                      X                                    xe2x80x2                                                                    )                                                                                            ❘                                                                                                X                                  xe2x80x2                                                                ⋐                                X                                                                                      }                                                                          -                                                                                                                                                                                                            Supp                        ⁡                                                  (                          a                          )                                                                    ,                                              Supp                        ⁡                                                  (                          X                          )                                                                                      }                                                                                      ⁢                  xe2x80x83                                    (Equation  1)            
where
and for deleting a candidate itemset where the calculated value does not satisfy minimum confidence; or
for calculating a condition of support count of an item xe2x80x9caxe2x80x9d constituting the negative item xcx9ca on the association rule containing the negative item xcx9ca in the assumption, using:                               θ          ⁡                      (            a            )                          =                              min            ⁢                          {                                                Supp                  ⁡                                      (                                          X                      xe2x80x2                                        )                                                  ❘                                                      X                    xe2x80x2                                    ⋐                  X                                            }                                -                                    Supp              ⁡                              (                X                )                                                    (Minimum  confidence)                                                          (Equation  2)            
where
and for deleting a candidate itemset where support count of the item xe2x80x9caxe2x80x9d is less than the calculated value;
(4) a step for preparing a product set of two frequent itemsets containing all affirmative items and (n-1) negative items in common in the itemset where n is an integer larger than 0, k is an integer larger than 1, in a frequent itemset comprising xe2x80x9cnxe2x80x9d negative items and xe2x80x9ckxe2x80x9d affirmative items and an itemset deleted by the calculated values using the equations 3 or 4, for preparing an itemset containing xe2x80x9ckxe2x80x9d affirmative items and (n+1) negative items, and for setting an itemset to a candidate itemset, wherein all subsets prepared from the itemset is contained in the frequent itemset or in the itemset deleted by the calculated values calculated using the equations 3 or 4; and
(5) a step for preparing a plurality of partitions by dividing the database, for preparing a bit string where an affirmative item contained in the records of the partition is set to 1 for each partition, for finding a combination of items where bit of the same partition is not turned to 1 in the bit string, and for deleting the candidate itemset containing said combination of items.
Other and further objects, features and advantages of the invention will appear more fully from the following description.