1. Field of the Invention
The present invention relates to a technique for extracting only meaningful frequent itemsets from a database in which a plurality of records each containing an itemset consisting of one or more items is stored. In particular, it relates to a technique for efficiently extracting such itemsets with a reasonable number of frequency calculations and with a reasonable amount of memory usage.
2. Description of Related Art
Various kinds of data mining of extracting useful knowledge from an enormous amount of accumulated data have been studied. Among them, a technique for detecting a group of items (e.g., products) that frequently occur in a target plurality of records (e.g., history of issued receipts) is called frequent pattern mining, and many methods of that technique have been proposed. Frequent pattern mining defines an itemset that satisfies “frequency of an itemset predetermined threshold (called “minimum support”)” as a frequent itemset and extracts a set of frequent itemsets.
However, a high frequency of occurrence of an itemset does not always mean a strong relationship between the items. For example, in the case of an itemset consisting of highly frequent items, even if the items are not related to each other, it is highly likely that that itemset frequently occurs. When the items are not related to each other, that frequent itemset is a meaningless itemset. In the meantime, it has been known that, for a practical minimum support value, which is provided by a user, an enormous number of frequent itemsets are typically generated.
There exists a traditional technique for introducing an idea of a closed frequent set, defining this, and extracting a closed frequent itemset that satisfies that definition (see M. Boley et al., “Efficient Discovery of Interesting Patterns Based on Strong Closedness,” Statistical Analysis and Data Mining, Volume 2, Issue 5&ash;6, Pages 346-360, December 2009). Here, an itemset Y is closed frequent if the condition that “Y frequently occurs and, for any Y⊂Y′ and Y≠Y′, frequency of Y>frequency of Y″” is satisfied. There also exists a traditional technique in which the above definition of a closed frequent set is extended such that an itemset that satisfies the condition that for a given threshold value δ>0, “Y frequently occurs and, for any Y⊂Y′ and Y≠Y′, (frequency of Y×δ)>frequency of Y′” is extracted (see J. Cheng et al., “Sigma-tolerance closed frequent itemsets,” ICDM, Proceedings of the Sixth International Conference on Data Mining, pages 139&#8211; 148, 2006).
When a closed frequent itemset is extracted, a subset of an itemset that has the same frequency of occurrence as that of the whole itemset is not uselessly extracted as a frequent itemset. For example, a case is discussed where in POS data of a supermarket, an itemset Y1={toothpaste, bread} occurs 500 times and an itemset Y2={toothpaste, bread, beer} also occurs 500 times. In this case, only the itemset Y2 is extracted as a closed frequent itemset. However, for the closed frequent itemset, its items are not considered to be highly related to each other. Thus, even with an idea of a closed frequent itemset, a meaningful frequent itemset cannot be extracted.
To address this, a technique for extracting an itemset whose items are highly related to each other is necessary. Such traditional techniques are described in Y. Ke et al., “Mining quantitative correlated patterns using an information-theoretic approach,” Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pages 227&#8211; 236, August 2006 and X. Zhang et al., “Mining Non-Redundant High Order Correlations in Binary Data,” Proceedings of the VLDB Endowment, Volume 1, Issue 1, pages 1178-1188, August 2008. The techniques described in Ke et al. and Zhang et al. aim to extract an itemset having three or more correlated items by an approach based on mutual information and entropy.
However, the techniques proposed in Ke et al. and Zhang et al. are based on pairwise comparison. Thus each of these techniques ensures merely a high correlation between any two items in an extracted itemset.
For example, a case where the above-described technique is applied to a call log in a call center and an itemset {operating system A, browser B, abnormal termination} was extracted is discussed. In this case, high correlations between {operating system A, browser B}, between {browser B, abnormal termination}, and between {operating system A, abnormal termination} are ensured. However, the condition that “an abnormal termination occurred in not other operating systems but the operating system A when not other browsers but the browser B is used” is not ensured. Japanese Unexamined Patent Application Publication No. 8-287106 discloses a correlation rule extraction technique.