Field of the Invention
The present invention relates to a pattern extraction apparatus suitable for extracting a frequent pattern from time-series data, and a control method for the same.
Description of the Related Art
There is a need for a method for analyzing enormous amount of data arranged in time series thereby to extract useful patterns embedded in the data. For example, with the basket analysis, a customer purchasing pattern such as “a customer who purchased Product A and then purchased Product B will subsequently purchase Product C”, can be known from POS data and customer information. This pattern can be utilized for creating a product sales strategy. Also, a typical file operation pattern of a given user can be known from a file operation log at the office, and this can be utilized for the recommendation for file operations, for example.
Sequential pattern mining is known as a mining technique for time-series data. Exemplary methods of sequential pattern mining are described in: Japanese Patent No, 3373716; R. Agrawal, R. Srikant, “Mining Sequential Patterns: Generalizations and Performance Improvements”, in proceedings of International Conference on Extending Database Technology, 1996; and J. Pei, J. Han, A. Behzad, H. Pinto, “Prefix Span: Mining Sequential Patterns Efficiently by Prefix Projected Pattern Growth”, in proceedings of International. Conference on Data Engineering, 2001. These conventional methods extract, from a database comprising items and time stamps (times) or identifiers indicating the order of occurrence, a time-series pattern having a support with a value that is greater than or equal to a minimum value (minimum support) of the support (ratio of the frequency of occurrence to all data) that is set by a user in advance. The support of a given time-series pattern is the proportion of data containing that time-series pattern in the entire database. A time-series pattern having a support greater than or equal to a minimum support is called a frequent time-series pattern. For the extraction of frequent time-series pattern, many methods have been proposed that involve repetition of the creation of time-series patterns serving as candidates (candidate time-series patterns) and the counting of the frequency of the candidate time-series patterns appearing in the database by database scanning. Such methods are called apriori-based methods. These conventional techniques extract time-series patterns in which the order of occurrence of the data in de database is directly captured.
However, as a time-series pattern contained in the actual data, not only fully ordered time-series Patterns in which the order of occurrence is directly captured, but also many time-series patterns containing a partially ordered relation, which have no order, exist. Further, in sequential pattern mining, only a plurality of pieces of time-series data are subjected to analysis. That is, in the above-described example of the basket analysis, a characteristic pattern observed for some of a plurality of persons can be extracted from the purchase data of these persons, but a characteristic pattern appearing several times in the purchase data of a single person cannot be extracted. In that case, the purchase data of a single person needs to be divided into a plurality of data pieces in some way for analysis.
In view of this problem of sequential pattern mining, the technique called episode mining has been proposed. In episode mining, the type of data is called an event, and an event sequence in which events are arranged in order of their times of occurrence serves as an input. The goal of episode mining is to extract a frequent partial event sequence, which is called an episode, from this event sequence. Episodes can be roughly classified into a serial episode in which the order of events is fully decided, a parallel episode in which there is no order between events, and a general episodes, which is a combination of the serial episode and the parallel episode. In the case of an episode containing events A, B, and C, the parallel episode can be denoted as (A, B, C), the serial episode can be denoted as A→B→C, and the general episode can be denoted as (A, B)→C, for example. This episode mining technique was proposed by H. Mannila, H. Toivonen, and A. I. Verkamo, “Discovery of frequent episodes in event sequences”, Data Mining and Knowledge Discovery, 1(3): 259-289, 1997. Many other episode mining techniques have thereafter been proposed. Many of the proposed techniques, however, can only extract serial episodes or parallel episodes. General episodes are broader, general-purpose episodes, including serial episodes and parallel episodes, and thus, there is a need for methods for extracting such general episodes as practically useful patterns.
One method for extracting the above-described general episode is described by Avdnash Achar, Srivatsan Laxman, Raajay Viswanathan, P. S. Sastry, “Discovering injective episodes with general partial orders”, Data Mining and Knowledge Discovery, Volume 25, Issue 1, pp 67-108, July 2012. This document proposes an apriori-based method, similarly to the above-described technique of sequential pattern mining. The point of this method is the creation of general episodes serving as candidates. According to this document, all episode pairs that satisfy conditions are fetched from a set of frequent general episodes each having a size of n, and a general episode is created by merging these pairs. Three sets of candidate general episodes each having a size of n+1 are generated for each pair, and finally, those satisfying constraints are generated as a set of candidate general episodes each having a size n+1.
A major problem of the method described in this document is that depending on the number of event types, the length of the input event, sequence, and the minimum support, the number of potential general episodes is increased enormously due to combinatorial explosion, and thus it takes a significant time to perform the frequency calculation by database scanning. For example, the number of potential candidate episodes for an episode having a length of 3 when there are ten types of events will be 120 for the parallel episode, 720 for the serial episode, and 2280 for the general episode. For actual data, it is hardly a case that the number of event types is 10, and it is more often the case that there are 100 or more event types. In that case, combinatorial explosion makes it difficult to perform pattern extraction within a realistic time period.