The present invention generally relates to data mining and, more particularly, to identifying partial periodic patterns in an event sequence, wherein patterns that are hierarchical in nature can be represented in the form of a meta-pattern.
Periodicy detection on time series data is a challenging problem of great importance in many real applications. The periodicy is usually represented as repeated occurrences of a list of events in a certain order at some frequency. Due to the changes of system behavior, some pattern may be only notable within a portion of the entire data sequence and different patterns may present themselves at different places. The evolution among patterns may also follow some regularity. Such regularity, if any, would be of great value in understanding the nature of the system that generated such events and building a prediction model. Consider the application of an Internet user profile. The sequence of web pages that a user accesses is often used to construct the user profile. An accurate profile is significant in many application domains including a personalization and recommendation system. During a period of time, a user may access some web sites repetitively. Such behavior may be represented by a periodic pattern that can be put into a user""s profile. Moreover, a user""s Internet access pattern may change over time. For instance, during a normal business day, one may surf financial web sites mostly when the stock market is open and may switch to sports oriented web sites for the rest of the day. At a coarser level, we may also find that such pattern holds during weekdays whereas a totally different pattern presents itself during weekends.
However, most previous research in this area has focused on mining patterns that only take basic events as their components and may not always recognize the above higher level pattern due to the presence of random noise. In general, some tolerable noise is usually allowed within a series of pattern repetitions to accommodate a certain degree of imperfection. As a result, two portions (of a data sequence) where a pattern is notable may have a different layout of pattern occurrences. There may not exist any common representation in terms of raw events. For example, two patterns (a, b, *) and (b, c) alternately appear in the sequence shown in FIG. 1. Here, a pattern may be only partially filled and xe2x80x9c*xe2x80x9d is used to substitute the xe2x80x9cdon""t carexe2x80x9d position(s). The length of each portion where (a, b, *) is notable is 19 and each portion where (b, c) is notable contains 6 symbols. In addition, each gap between notable portions of (a, b, *) and (b, c) consists of 2 positions while a three-position gap presents itself after each notable portion of (b, c). All of these can be represented by a higher level pattern of four components ((a, b, *):[1,19], *:[20,21], (b, c):[22,27], *:[28,30]). The numbers in the brackets indicate the offset of the component within the pattern. Let""s take a closer look at those two portions where the pattern (a, b, *) is notable: one is from position 1 to 19 and the other is from position 31 to 49. Note that both portions contain some noise that impairs the perfection on repetition of (a, b, *). Neither of them can match a single basic pattern format (i.e., (a, b, *, a, b, *, a, b, *, a, b, *, a, b, *, a, b, *)). Since the locations and durations of the noise are different in these two portions, (a, b, *, a, b, *, a, b, *, *, *, *, *, a, b, *, a, b, *) and (a, b, *, a, b, *, *, a, b, *, a, b, *, a, b, *, a, b, *) do not match with each other. In general, the noise could occur anywhere, be of various duration, and even occur multiple times within the portion where a pattern is notable as long as the noise is below some threshold. Even though the allowance of noise plays a positive role in characterizing system behavior in a noisy environment, it prevents such a higher level pattern from being represented in the form of an equivalent basic pattern.
The present invention addresses the above and other issues by providing pattern mining methods and systems that employ a xe2x80x9cmeta-patternxe2x80x9d model which provides a more powerful mechanism for periodicy representation. In contrast to existing periodicy models, each component of a meta-pattern according to the invention is allowed to be either a simple event or a pattern (or lower level meta-pattern). We refer to those patterns that only contain simple events as their components as xe2x80x9cbasic patterns.xe2x80x9d
It is to be appreciated that the recursive nature of a meta-pattern according to the invention not only can provide a more compact representation of complicated patterns but also can capture the regularities of pattern evolutions, which may not be expressible by existing models. In order to accommodate a certain degree of noise, a meta-pattern is said to be xe2x80x9cvalidxe2x80x9d in a symbol sequence if there exists, in the symbol sequence, a list of segments of perfect repetitions of the meta-pattern where the number of repetitions in each segment is at least a prespecified threshold (min_rep) and the distance between any two consecutive segments is at most a prespecified threshold (max_dis).
However, the flexibility of a meta-pattern may pose challenges in the discovery process, which may not be encountered in mining basic patterns, for instance:
(i) While a basic pattern has two degrees of freedom: the period (i.e., the number of components in the pattern) and the choice of symbol/event for each component, a meta-pattern has an additional degree of freedom: the length of each component in the pattern. It is incurred by the fact that a component may occupy multiple positions. This extra degree of freedom would increase the number of potential meta-pattern candidates dramatically.
(ii) Many patterns/meta-patterns may collocate or overlap for any given portion of a sequence. For example, both of (a, b, a, *) and (a, *) are valid within the sequence a b a c a b a b a b a a a d a d a b a d a c a d b b a d. As a result, during the meta-pattern mining process, there could be a large number of candidates for each component of a higher level meta-pattern. This also aggravates the mining complexities.
Therefore, how to identify the xe2x80x9cproperxe2x80x9d candidate meta-patterns becomes very crucial to the overall efficiency of the mining process. To address this issue, the present invention employs a xe2x80x9ccomponent property,xe2x80x9d in addition to the traditionally used xe2x80x9ca priori property,xe2x80x9d to prune the search space. This is inspired by the observation that a pattern may participate in a meta-pattern only if its notable portions exhibit a certain cyclic behavior. Thus, in accordance with the invention, a xe2x80x9csegment-basedxe2x80x9d algorithm is provided to identify the potential period of a meta-pattern and, for each component of a possible period, the potential pattern candidate(s) and its length within the meta-pattern. The set of all meta-patterns can be categorized according to their structures and are evaluated in a designed order so that the pruning power provided by both properties can be fully utilized.
Accordingly, as will be explained in further detail below, the present invention provides the following advantageous features that serve to greatly improve pattern discovery in time series data such as event data:
(i) A meta-pattern model to capture the cyclic relationship among discovered periodic patterns and to enable a recursive construction of exhibited cyclic regularities.
(ii) A component property to provide further pruning power, in addition to the traditional a priori property.
(iii) A segment-based algorithm to identify potential meta-pattern candidates.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.