In recent years, there has been an increase in studies on data mining which is used to discover useful or interesting patterns as knowledge from a massive amount of data. The usefulness varies from one person to another, and is thus difficult to define. However, in general, knowledge to explain many cases is considered to be useful (see Non-Patent Reference 6, for example). Ever since the Apriori algorithm was proposed in 1994, whereby frequent item sets are enumerated from data including plural item sets (see Non-Patent Reference 1, for example), frequent pattern enumeration algorithms have been proposed for various kinds of data structures. Recently, high-speed methods of enumerating frequent substructure patterns that appear in complex structures such as graphs have been proposed (see Non-Patent Reference 9, for example).
FIGS. 14 to 16 are diagrams for explaining one example of a method of enumerating frequent item sets using the Apriori algorithm. By using the Apriori algorithm, data combinations frequently appearing in plural data sets can be extracted at high speed, for example.
Consideration is given to the case where the data combinations which appear at least twice are to be extracted from four data sets, which are {R, Y, P}, {B, Y, G}, {R, B, Y, G}, and {B, G} as shown in FIG. 14. These data sets include five kinds of data pieces which are R, B, Y, P, and G. Thus, as the data combinations, there are: five kinds of data combinations each including one piece of data (=5C1); ten kinds of data combinations each including two pieces of data (=5C2); ten kinds of data combinations each including three pieces of data (=5C3); five kinds of data combinations each including four pieces of data (=5C4); and one kind of data combination including five pieces of data (=5C5). In total, there are 31 kinds of data combinations.
FIG. 15 is a diagram showing a search tree in which a vertex corresponds to a data combination. A vertex label shown in this diagram denotes the data combination as well as the number of data sets that include the present combination. For example, there are two data sets in which the data combination {R, Y} appears (namely, {R, Y, P} and {R, B, Y, G}). Thus, “RY2” is described as the vertex label. In the diagram, the nearer the root, the fewer the number of data sets. Also, the nearer the leaves, the more the number of data sets. Regarding the vertices connected with edges, the number of data pieces included in the data combination of a child vertex is larger by one than the number of data pieces included in the data combination of a parent vertex. In the case where a search is to be performed in the search tree according to an exhaustive search algorithm, the number of appearances needs to be calculated for each of 31 data combinations.
FIG. 16 is a diagram for explaining a method of extracting a data combination which appears at least twice, according to the Apriori algorithm. First, the above-mentioned numbers of appearances are calculated for the combinations each including only one piece of data (namely, {R}, {B}, {Y}, {P}, and {G}). The results are twice, three times, three times, once, and three times, respectively. Since the number of appearances of the data combination {P} is one, each number of appearances of the other data combinations including the data combination {P} is fewer than twice. On account of this, the search does not need to be performed for the other data combinations including the data combination {P} (i.e., for descendant vertices of the vertex with the label P1 in the search tree). Accordingly, the calculation of the numbers of appearances is terminated. Similarly, out of the data combinations each including two pieces of data, the data combinations {R, B} and {R, G} appear once. Therefore, the calculation of the numbers of appearances for the other data combinations including these data combinations is terminated as well. Thus, the data combinations which appear at least twice can be obtained at high speed. As described so far, according to the Apriori algorithm, a search for a pattern which is not expected to reach a goal is terminated and therefore a search for a frequent pattern can be made at high speed.
Targets of the graph mining have been mainly graphs which do not change over time.    Non-Patent Reference 1: R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of Very Large Data Base, pp. 487-499, 1994.    Non-Patent Reference 2: A. Inokuchi et. al., An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data, Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, pp. 13-23, 2000.    Non-Patent Reference 3: Inokuchi, T. Washio, Y. Nishimura, & H. Motoda, A Fast Algorithm for Mining Frequent Connected Subgraphs, IBM Research Report, RT0448 February, 2002.    Non-Patent Reference 4: M. Kuramochi & G. Karypis, Frequent Subgraph Discovery, Proceedings of International Conference on Data Mining, pp. 313-320, 2001.    Non-Patent Reference 5: Kuramochi & G. Karypis, Finding Frequent Patterns in a Large Sparse Graph, Proceedings of SIAM Data Mining, 2004.    Non-Patent Reference 6: H. Motoda, Fascinated by Explicit Understanding, Journal of the Japanese Society for Artificial Intelligence, pp. 615-625, 1999.    Non-Patent Reference 7: S. Nijssen & J. Kok, A Quickstart in Frequent Structure Mining can Make a Difference, Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 647-652, 2004.    Non-Patent Reference 8: J. Pei, et. al., PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth, Proceedings of International Conference on Data Engineering, pp. 215-224, 2001.    Non-Patent Reference 9: T. Washio & H. Motoda, State of the Art of Graph-based Data Mining, SIGKDD Explorations, Vol. 5, No. 1, pp. 59-68, 2003.    Non-Patent Reference 10: X. Yan & J. Han, gSpan: Graph-Based Substructure Pattern Mining, Proceedings of International Conference on Data Mining, pp. 721-724, 2002.