In known network management systems referred to as episode mining systems, platforms have been arranged which track or log the activities of users across various sites and resources of the network. Known platforms can for example be configured to log user activity such as visiting various Web sites, sending or receiving email, initiating instant message (IM) or other chat or messaging sessions, accessing a telephone support line, or others. Network event logs are becoming an important data source in business applications. A typical example of event logs is a web server access log, which maintains a history of page requests from users. Enterprises need to analyze such Web server logs to discover valuable information including Website traffic patterns and user activity patterns by time of day, time of week, or time of year.
For instance, Human Resource portals (e.g. in benefits enrollment and maintenance) or Financial Service portals (e.g. student loans) often need to monitor and analyze various user events such as life events (e.g., marriage, having a baby, etc.) and financial choices: (e.g. 401K, insurance, etc.) obtained from multiple, applications or services including telephone calls, web log, and chats. Using the results (i.e., sequential patterns of user events), a systems administrator can precisely target the future events (e.g., future phone calls) of each user. A core part of episode mining systems is discovering episodes from large event logs, where an episode is defined as a collection of events occurring frequently together within a time interval (“episode” and “frequent pattern” can be used interchangeably herein this ID). An event log is divided into a set of event sequences by a certain time interval (e.g., web session in web log).
Although a number of episode mining (also referred to as frequent sequential pattern mining) platforms and techniques have been developed, the mining performance is limited in that they require from a few hours to a few days to execute, especially when very large event logs are processed as input. Therefore, classical episode mining techniques have typically been implemented as batch processes only. Two of the known episode mining approaches are referred to as the classical apriori algorithm, and its more general version, sometimes referred to as the GSP algorithm. Generalized algorithms such as GSP have also been introduced to integrate some domain-specific time constraints to sequential patterns. The basic algorithm of the known apriori algorithm operates as follows:
While k-length candidate patterns exist:
                1) For each k-length candidate pattern                    i. For each event sequence in event log            ii. If the candidate pattern exists in the event sequence                            1. Increase the count of the occurrence of the candidate pattern;                                                2) Record a set of frequent k-length patterns, based on their counts;        3) Building a set of new (k+1)-length candidate patterns from the prior frequent patterns;        4) Increase the pattern length (i.e., k=k+1).        
The length of a pattern indicates the number of events in the pattern or sequence. For example, 3-length pattern looks like “ABC”, where each character represents an event. As shown in the steps above, the apriori approach iteratively scans all event sequences to figure out frequent patterns. Specifically, at each step in the iteration (i.e., the outer while loop), it builds new candidates from previous frequent patterns (step 3), increases the pattern length by one (step 4), and discovers new frequent patterns by scanning all given event sequences (step 1). This iteration is terminated when the process cannot build further candidates.
The bottleneck of the apriori algorithm is mainly caused by blindly scanning all given event sequences per step with a number of candidates. The apriori technique must do this type of scanning multiple times, to generate all different lengths of frequent patterns. Although this algorithm is direct and can be extended to integrate various domain specifics, it has been known that applications can be limited due to the performance issue when analyzing large event logs.
The known pattern-growth algorithms have been introduced to attempt to speed up the episode mining. This class of algorithm does not build candidate episodes. Instead, it recursively constructs new projected sub-sets of event sequences and then, restricts the pattern search in each projected set. For example, an X-projected set is a sub-set of event sequences, which has a set of event sequences starting with the pattern prefix “X”. The size of this X-projected set indicates the frequency of the pattern “X”. At each step, the algorithm extends the prefix (e.g., “XY”), and then constructs a new projected set (e.g., XY-projected set) from the X-projected set.
It is known that approaches using the pattern-growth algorithms usually outperform apriori-based approaches because they can avoid building candidates, and reduce the search space as the mining, progresses. However, constructing projected sets can likewise be computationally expensive. Additionally, under a high pattern density condition, where identical patterns are densely packed into event sequences, the reduction rate of the search space can be lower than expected over the complete mining process due to large overlaps between projected sets. This condition can often occur in many event logs such as web log because the number of event types (e.g., requested URLs) is usually small, and event types can be densely distributed over event sequences (i.e., patterns can be found very frequently in an event sequence). Moreover, since these approaches have not considered some mining conditions such as time interval constraint and time window to flexibly define patterns as GSP does, it is difficult to apply domain-specific time constraints to these approaches.
The pattern-growth algorithm can therefore be computationally expensive for large event logs. Moreover, possible speed-up can be limited, especially under the noted high pattern density conditions. The pattern-growth method also has not considered domain, specific time constraints, such as time interval, time gap, and moving windows. As an alternative, statistical approaches can also be used, but only for relatively small datasets, assuming some specific distributions for the event sequence history.
Due to these constraints and other factors, all of the above known methods have been limited to being implemented as batch processes only, given limited computing resources available to developers, administrators, and other users. Potentially, they do not discover or generate all highly frequent patterns, particularly if they are forced to perform under a short time runtime threshold.
Moreover, the classical episode methods outlined above are limited in their application with some domain-specific constraints, again mainly due to the performance issue. Those constraints include the requirement for “agile mining”, or the ability to see highly frequent patterns within a short time, as well as detection accuracy, or the completeness of generated episodes, and the tradeoff between them. The accuracy of detection means that the episode mining technique does not miss any highly frequent patterns, and the completeness criteria means that the episode mining process will generate all possible frequent episodes. To attempt to achieve the reasonable accuracy and completeness, an episode mining algorithm must explore all possible candidate episodes in apriori-based approaches, or all possible projected sub-sets of event sequences in pattern-growth-based approaches. Meanwhile, to achieve the requirement of agile mining, especially within a time threshold that may be given by user, known episode mining algorithms potentially sacrifice accuracy and the completeness, by interrupting the process before it is completed. Known classical methods therefore do not guarantee the accuracy when producing time-constrained intermediate results.
This is the case even though administrators or other users often want to quickly capture highly frequent patterns to see the high-level pattern trends, or to seek an agile targeting for certain problems in many practical situations, instead of waiting hours or days until a batch process generates a set of complete episodes. They may want to flexibly provide the upper-bound time threshold to the system to obtain the highly frequent and meaningful patterns first, and discover the rest of patterns when time permits, possibly by later batch processing.
It may therefore be desirable to provide methods and systems for self-adaptive episode mining under time threshold using delay estimation and temporal division, in which data mining operations, can be conducted under a given time threshold or limit, and which may be achieved by quantifying and incorporating analysis of the tradeoff between the completeness and the runtime budget or other time threshold, limit, and/or constraint of the desired episode mining.