As it becomes feasible to collect large volumes of data, businesses are increasingly looking for ways to capitalize on this data, especially market data. Thus, such businesses turn toward data mining techniques. As is known, data mining seeks to discover interesting and previously unknown patterns or information from a large amount of historical data often stored in a database, e.g., market data. Specifically, one key aspect of data mining is to search for significant patterns embedded in the data. Here, a pattern refers to a set of items denoted as pat={i1, i2, i3, . . . ik}, where ik is the k-th item.
Existing approaches have focused on discovering one special form of pattern called a frequent association pattern, referred to as “fa-pattern.” Fa-patterns are patterns whose support (or occurrences) in data is above a predefined minimum support threshold called minsup. Several applications of the fa-pattern have been studied. The most popular one is the “market basket” analysis, in which an algorithm is applied to mine transaction data consisting of a set of transactions. A transaction is a set of items purchased by a customer. For example, a customer may buy milk, bread, and beer, together. The corresponding transaction is thus trans={a, b, c}, where a, b, and c may represent milk, bread, and beer, respectively. The association discovery problem can be formally stated as: find all patterns (i.e., a set of items) whose number of co-occurrences in D is above a predefined threshold called minimum support (minsup), where D is a set of N transactions {trans1, . . . , transN}. We note that an item here is a generic name. It is mapped to an original data object by a certain mapping scheme. For example, an original data object of transaction data may have multiple attributes such as the type of goods (e.g., milk, beer), its brand, quantity, and purchase time. One commonly-used mapping is to map the values of the type into items. For example, milk is represented by “a.”
Fa-patterns can be generalized to handle temporal events. Here, the temporal event data is an ordered sequence with length N: {(i1, t1), (i2, t2), . . . , (iN, tN)}, where ti≦tj if i≦j. The temporal association discovery problem can be stated as: find all patterns whose number of co-occurrences within a time window w is above minsup. Here, the time window is introduced to essentially segment an ordered time sequence into transactions.
Finding all fa-patterns is not a trivial task because the pattern space is exponentially large, to be precise, nk, where n is the number of distinct items, and k is the maximum length of a pattern. Brute-force iteration is computationally intractable. Recently, Agrawal et al. (as described in R. Agrawal et al., “Mining Association Rules Between Sets of Items in Large Databases,” Proc. of VLDB, pp. 207–216, 1993, the disclosure of which is incorporated by reference herein) developed an algorithm called “Apriori” to discover all fa-patterns. This algorithm searches the pattern space in a level-wise manner by the following four step process:
1. Initialization. The data is scanned to find all fa-pattern with only one item. k is set to be 2.
2. Construct candidate patterns with length k. This is typically done by a joint operation of fa-patterns found in the previous level, followed by a pruning operation.
3. Count the candidate patterns. Data is scanned in order to count the occurrences of candidate patterns.
4. Find fa-patterns at the k-th level. Fa-patterns are those candidate patterns whose count (or occurrences) are above minsup.
This procedure proceeds level by level until no more patterns can be found. The key idea of this algorithm is to search the pattern-space in a level-wise manner. The fa-patterns found at the current level are used to eliminate the search space for the next level. In this way, the number of patterns to be searched are minimized, and the number of data scans is the maximum length of fa-patterns. Since the introduction of the “Apriori” algorithm, work has been done to improve the algorithm so as to reduce the number of data scans, reduce the memory requirement, and improve efficiency through different search strategies.
However, in applications such as detecting anomalies in computer networks and identifying security intrusions, there is much more interest in patterns that predict undesirable situations, such as service disruptions. Such patterns are often infrequent (at least in well managed systems) and are characterized by statistical dependency rather than their frequency. Unfortunately, the statistical dependency based on a conventional dependency test yields neither upward nor downward closure, and hence efficient discovery algorithms cannot be constructed.
The present invention is motivated by issues that have been encountered in discovering patterns of events in computer networks. First, as indicated above, the present invention is concerned with how to discover infrequent, but dependent item sets. In computer networks, dependent temporal event sequences provide knowledge to predict later events, which is of particular interest if these events are related to malfunctions and/or service disruptions. Unfortunately, existing mining techniques require the support thresholds to be set very low in order to discover infrequent patterns. This results in a large number of unimportant patterns mixed in with a few patterns of interest.
Second, an application may have to deal with data collected in a noisy environment. In networks, data may be lost because of severed communication lines or router buffer overflows. In help desk systems, data may be corrupted because of human errors. Some valid patterns will be missed due to the presence of noise. To illustrate, suppose there is a 15-item pattern with a true frequency of 15% and the minimal support is set to be 10%. Assume that the data is received through a transmission channel in which each item could be lost with a probability of 5%. Due to the missing information, the observed frequency will be 15%*(0.95)15 or approximately 7%, which is less than the minimal support. With a bit more calculation, it can be seen that only subpatterns with lengths no greater than 7 would satisfy a minimal support of 10%. Consequently, instead of reporting one long pattern with length 15, over 6435 subpatterns with length 7 or less are found and reported as the maximal frequent item sets. Clearly, the problem is due to the fixed minimum support threshold that favors short item sets over long ones.
Third, one may be concerned with skewed distributions of items. It has been found, through experience with alarms in computer systems, that 90% of the events are often generated by only 10% or 20% of the hosts. In B. Liu et al., “Mining Association Rules with Multiple Minimum Supports,” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, the disclosure of which is incorporated by reference herein, it has been argued that this is a major obstacle to applying traditional association mining in which a minimum support is fixed. This motivates B. Liu et al. to use multiple minimum support thresholds that take into account the distribution of items and the length of an item set. However, this introduces extra parameters, which complicates pattern discovery.