1. Technical Field
The present invention generally relates to data mining and, more particularly, to identifying partial periodic patterns in an event sequence. Imperfect patterns may be transformed into perfect patterns through random replacement of events.
2. Background Description
Periodic pattern discovery is an important problem in mining time series data. A periodic pattern is a list of ordered events which repeats itself in an event sequence. Periodic pattern discovery is useful in characterizing the cyclic behavior of time series data. However, in practice, not every portion in the time series data contributes to the repetitiveness of a given pattern(s) in the time series. For example, the stock of a particular company may gain a few points at the beginning of each trading session but may not have much regularity at other times. This type of looser repetitiveness is often referred to as xe2x80x9cpartial periodicyxe2x80x9d.
Moreover, due to random noise, a pattern may not always be perfectly repeated. Thus, the event sequence can be viewed as a series of perfect pattern repetitions with a few xe2x80x9crandom replacementsxe2x80x9d. That is, the event sequence would become a series of perfect repetitions of some pattern after a few replacements (of some of the events in the sequence). If the amount of xe2x80x9creplacementxe2x80x9d is below some reasonable threshold, the pattern may be considered to be xe2x80x9cexhibitedxe2x80x9d in the event sequence.
The mining of time series data is a newly developing research area. Most of the previous work on mining time series data includes creating a mapping to the association rule mining technique, i.e., each event corresponds to an item and every certain number of consecutive events are treated as a transaction. Metrics referred to as xe2x80x9csupportxe2x80x9d and xe2x80x9cconfidencexe2x80x9d are used to identify significant patterns from the remaining data. The support of a pattern is the number of occurrences of that pattern within a sequence of events. The confidence of a pattern is defined as the fraction specified by the number of occurrences of the pattern within the sequence of events over the length of the sequence. Most association rule mining methods favor frequently occurring events. However, patterns involving infrequent events may also be as significant as (or even more significant than) frequent events in an event sequence. This issue becomes more critical when different events occur at divergent frequencies. As examples, consider the following three illustrative applications.
The first application corresponds to the workload on a cluster of web servers. In the example, the cluster consists of 5 servers, with the workload on each server being measured by the following 4 ranks: low; relatively low; relatively high; and high. Thus, there are 45 (1024) different events, one for each possible combination of server states. Examination of the cluster states over time may show that the fluctuations of the server states comply with some periodic behavior. Obtaining such knowledge would be very beneficial for understanding the behavior of the cluster and for improving its performance.
The second application corresponds to earthquakes. In general, earthquakes are classified by their magnitude and type. Scientists may be interested in knowing whether there exists any inherent seismic period so that future earthquake predictions can be made. Note that patterns involving big earthquakes are much more valuable than patterns involving small earthquakes, even though big earthquakes occur at a much lower frequency than small earthquakes.
The third application is the stock market. Stock investors would like to understand the movement of the stock market so that they can take advantage of the movement to maximize their profits. In the example, the movement of the market on any given day can be categorized by the following descriptions: significant increase; moderate increase; unchanged; moderate decrease; and significant decrease. Even though movements such as significant increase and significant decrease occur less frequently than the other types of movements, the importance of capturing the former types of movements is readily apparent.
The above examples share a common characteristic; that is, different events may exhibit different rates of occurrence. Further, the above examples also show that patterns involving infrequent events are also of great importance. Thus, it would be desirable and highly advantageous to identify any strong tendency incurred by an event (or combination of events). Consider the following event sequence, a1 a1 a1 a1 a1 a2 a1 a1 a2 a1 a3 a2 a3 a1 a3 a3 a1 a3 a1 a1. The event a1 occurs 12 times while the event a2 occurs only 3 times. Despite its low frequency, the occurrence of a2 follows a very strong pattern. In particular, an occurrence of a2 is followed by another occurrence of a2 three positions later. This type of tendency or pattern in the event sequence may be very crucial in some application domains.
A closer look will now be taken at the following two patterns (a1, *, *) and (a2, *, *), wherein * represents an xe2x80x9ceternal eventxe2x80x9d. An eternal event is a virtual event that matches any event. Thus, an eternal event can be used to represent the xe2x80x9cdon""t carexe2x80x9d position in a pattern. An eternal event may also be referred to as xe2x80x9cnon-distinctxe2x80x9d, in which case an event such as a2 may also be referred to as xe2x80x9cdistinctxe2x80x9d. Even though the support of (a1, *, *) is greater than that of (a2, *, *) (due to the fact that al occurs much more frequently than a2), no strong tendency is preserved by the occurrence of a1. In the support/confidence model, in order to find pattern (a2, *, *), the support threshold has to be set lower than 3. In general, this would lead to the examination of a larger number of patterns which do not exhibit a strong tendency (e.g., (a1, *, *)). This indicates that support is not necessarily a proper measurement in evaluating periodic patterns. Thus, a new metric to measure the significance of a pattern needs to be designed to meet this goal. The support/confidence thresholds can be viewed as xe2x80x9cabsolutexe2x80x9d bounds applied to every event irrespective of their occurrence frequencies.
Thus, it would be desirable and highly advantageous to have a new metric for use in data mining which overcomes the deficiencies of the metrics of the prior art. Moreover, it would be desirable and highly advantageous to have a method for identifying significant patterns in subsequences of an event sequence.
The problems stated above, as well as other related problems of the prior art, are solved by the present invention, a method for identifying partial periodic patterns in event sequences.
According to a first aspect of the invention, there is provided a method for mining partial periodic patterns in an event sequence, wherein each pattern includes a list of events from the event sequence. At least one subsequence of the event sequence and at least one pattern in the at least one subsequence are identified, such that an information gain of the at least one subsequence with respect to the at least one pattern exceeds a predefined threshold. At least one of the at least one pattern and the at least one subsequence is stored.
According to a second aspect of the invention, each of the events in a given pattern is one of distinct and non-distinct, wherein each non-distinct event matches any of the distinct events.
According to a third aspect of the invention, the method further includes the step of determining the information gain of the at least one subsequence with respect to the at least one pattern. The determining step includes the step of dividing the at least one subsequence into contiguous segments, wherein each contiguous segment is a same length as the at least one pattern. For each contiguous segment, the information gain of the contiguous segment is calculated, when all of the events of the contiguous segment match the corresponding events of the at least one pattern. The information loss of the contiguous segment is calculated, when at least one of the events of the contiguous segment does not match the corresponding event of the at least one pattern. The information gain of the at least one subsequence si calculated, by summing the information gain and the information loss of each of the contiguous segments comprised therein.
According to a fourth aspect of the invention, the step of calculating the information gain of each of the contiguous segments includes the steps of assigning an information value to the at least one pattern, and assigning an information loss to at least one deviated pattern from the at least one pattern.
According to a fifth aspect of the invention, the step of assigning an information value to the at least one pattern includes the steps of determining an information value of each of the events in the at least one pattern, and summing the information values of each of the events in the at least one pattern.
According to a sixth aspect of the invention, the step of determining the information value of each of the events in the pattern includes the step of determining a probability of occurrence in the event sequence of each of the events in the pattern.
According to a seventh aspect of the invention, the method further includes the step of replacing at least one of the events in the at least one pattern, to obtain a higher degree of repetitiveness of the events in the at least one pattern with respect to the at least one subsequence.
According to an eight aspect of the invention, the at least one subsequence maximizes the information gain of the at least one pattern with respect to any other subsequences of the event sequence.
According to a ninth aspect of the invention, the step of identifying the at least one subsequence includes the step of identifying a subsequence of the event sequence that maximizes the information gain with respect to a pattern contained therein.
According to a tenth aspect of the invention, there is provided a method for mining partial periodic patterns in an event sequence, wherein each pattern comprising a list of events from the event sequence. Singular patterns are identified in the event sequence, wherein each singular pattern has only one position filled by one of the events in the event sequence. Complex patterns are identified in the event sequence, wherein each complex pattern has more than one position filled by one of the events in the event sequence. The complex patterns and subsequences of the event sequence that comprise the complex patterns are identified, such that an information gain of each of the subsequences with respect to the complex pattern comprised therein exceeds a predefined threshold. At least one of the complex patterns and the subsequences that contain the complex patterns are stored.
These and other aspects, features, and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.