In data mining for analyzing vast amounts of data and thereby extracting useful information buried therein, an association rule showing an association (linkage) of the data is known. For example, the data mining of supermarket basket data will be considered. There are multiple items in a supermarket and a customer purchases a combination of some of the items. The combination of the items purchased by the customer is recorded as basket data. In the case of analyzing a large amount of basket data, it is desirable to extract a significant itemset, that is, a pattern which appears in multiple customers who have purchased. This pattern is referred to as “frequent itemset” (large itemset). If an association rule like “a customer who has simultaneously purchased an item A and an item B also often simultaneously purchases an item C and an item D” has been extracted, it is found that there is relevance between sales of the items C and D and sales of the items A and B, which can help in making policies of sales such as arrangement of the items, selection of bargain goods and pricing.
Studies of association rule extraction have been performed in a field of the data mining. For example, there are methods described in Patent Document 1, Patent Document 2 and Non-Patent Document 1. In conventional approaches such as Patent Document 1, Patent Document 2 and Non-Patent Document 1, a combination in which attribute values become true, which becomes equal to or more than a minimum threshold of a support value (minimum support) which has been previously set by a user, has been extracted from a database consisting of a set of records including multiple binary attributes, and from the extracted combination, an association rule which becomes equal to or more than a minimum threshold of a confidence value (minimum confidence) which has been previously set by a user, has been derived. In each record, a pair of an attribute and an attribute value is referred to as “item”. The support value is a ratio of records including a combination of items in the entire database. A combination of items beyond the minimum threshold of the support value, which is extracted by these methods, is referred to as “frequent itemset”. The association rule is derived from a subset of itemsets included in the frequent itemset. In these conventional arts, an analysis object is an ideal database not including missing values, and a database including the missing values is not considered.
However, the missing values may exist in the database to be analyzed. For example, in the case of gene analysis data in a medical field, there is a locus at which a genotype cannot be analyzed, depending on a state of a specimen, gene sequences around a locus to be analyzed, and a state of an analysis device. The locus at which the genotype cannot be analyzed is different for each patient, and also, loci at which the genotype can be analyzed and loci at which the genotype cannot be analyzed are mixed in each patient. Useful information can be obtained by analyzing gene data and case data at a locus where the genotype has been able to be analyzed, for multiple patients. It is possible to know a relationship between a gene and a drug effect or the like by extracting the association rule from the gene data and the case data as analysis objects. For example, if an association rule like “a patient with a genotype Y at an X-th locus of a gene A develops an allergic reaction to a drug C” has been extracted, examination of a type of the X-th locus of the gene A of the patient can help in determination of whether or not to prescribe the drug C, and it is possible to provide medication appropriate for each patient. If the conventional art has been applied, the support value of the itemset becomes an incorrect value, and a correct association rule cannot be extracted.
Another example will be shown. For example, in the case of the supermarket basket data, individual stores may sell different items. For example, it is possible to know a trend in selling the items within a controlled area by analyzing the basket data in multiple stores within the controlled area. In order to examine relevance between the item A and the item B, only the basket data in stores which sell both the item A and the item B is used. If the basket data in a store which does not sell the item A or the item B has been used in the analysis, an incorrect result is obtained.
With respect to a method of extracting the association rule from the database including the missing values, there is Non-Patent Document 2. In the method of Non-Patent Document 2, an association rule which becomes equal to or more than the minimum threshold of the support value and a minimum threshold of a representativity which have been previously set by the user, has been extracted from a database of a so-called tabular form of column by row, including multiple records having multiple discrete-value attributes. Here, a pair of an attribute and an attribute value is referred to as “item”, and a combination of items is referred to as “itemset”. The number of records in which the itemset appears, in the database, is referred to as “support count”, and a ratio of records including the combination of the items, in records in which the attribute constituting the item is not the missing value, is referred to as “support value”. A ratio of the number of records in which the attribute included in the association rule is not the missing value to the number of all records in the database is referred to as “representativity”.
A procedure for extracting the association rule in the method of Non-Patent Document 2 will be described. At the first step, records in the database are retrieved, and for each item, the number of records in which the above described item appears is counted and IDs of records in which an attribute constituting the above described item is the missing value are obtained. The number of records in which one item X appears is referred to as “support count”, and a list of IDs of records in which an attribute constituting one item X is the missing value, is referred to as “missing record list”. When the counting has been completed for all records, the support value of each item is calculated, and an item which becomes equal to or more than the minimum threshold of the support value is retrieved. The item in which the support value becomes equal to or more than the minimum threshold is referred to as “frequent item”. Here, the support value of one item X is a quotient of the support count of the item X and a value obtained by subtracting the number of the IDs in the missing record list of the item X from the number of the records in the entire database. At the next step, two frequent items are combined and an itemset consisting of the two items is generated. An itemset with an unknown support count is referred to as “potential itemset”. For each potential itemset, a union of IDs in missing record lists of items constituting the potential itemset is a missing record list of the above described potential itemset. Again, the records in the database are retrieved, and the support count is counted for each potential itemset. When the counting has been completed for all records, the support value of each potential itemset is calculated, and a potential itemset in which the support value becomes equal to or more than the minimum threshold is retrieved. The potential itemset in which the support value becomes equal to or more than the minimum threshold is referred to as “frequent itemset”. At subsequent steps, in a frequent itemset constituted with k items, steps of combining frequent itemsets having common (k−1) items, generating (k+1) potential itemsets, obtaining the missing record lists, retrieving the records in the database, counting the support count of each potential itemset, calculating the support value, and retrieving the frequent itemset are repeated. When all frequent itemsets have been extracted, for the frequent itemset consisting of k items, an association rule is generated from a sub-itemset of an itemset constituting the above described frequent itemset.    Patent Document 1: JP Patent Publication (Kokai) No. 8-287106 A (1996)    Patent Document 2: U.S. Pat. No. 5,794,209    Non-Patent Document 1: G. Liu, H. Lu, Y. Xu, J. Yu, “Ascending frequency ordered prefix-tree: efficient mining of frequent itemsets”, in proceedings of International Conference on Database Systems for Advanced Applications, 2003    Non-Patent Document 2: A. Ragel, B. Cremilleux, “Treatment of missing values for association rules”, in proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1998