The present invention relates generally to data processing techniques and, more particularly, to methods and apparatus for discovering mutual dependence patterns in data.
Data mining seeks to discover interesting and previously unknown patterns or information from a large amount of historical data often stored in a database. Specifically, one key aspect of data mining is to search for significant patterns embedded in the data. Here, a pattern refers to a set of items denoted as pat={i1, i2, i3, . . . , ik}, where ij is the j-th item.
Existing approaches have focused on discovering one special form of pattern called a frequent association pattern, referred to as xe2x80x9cfa-pattern.xe2x80x9d Fa-patterns are patterns whose support (or occurrences) in data is above a predefined minimum support threshold called minsup. Several applications of the fa-pattern have been studied. The most popular one is the xe2x80x9cmarket basketxe2x80x9d analysis, in which an algorithm is applied to mine transaction data consisting of a set of transactions. A transaction is a set of items purchased by a customer. For example, a customer may buy milk, bread, and beer, together. The corresponding transaction is thus trans={a, b, c}, where a, b, and c represent milk, bread, and beer, respectively. The association discovery problem can be formally stated as: find all patterns (i.e., a set of items) whose number of co-occurrences in D is above a predefined threshold called minimum support (minsup), where D is a set of N transactions {trans1, . . . , transN}. We note that an item here is a generic name. It is mapped to an original data object by a certain mapping scheme. For example, an original data object of transaction data may have multiple attributes such as the type of goods (e.g., milk, beer), its brand, quantity, and purchase time. One commonly-used mapping is to map the values of the type into items. For example, milk is represented by xe2x80x9ca.xe2x80x9d
Fa-patterns can be generalized to handle temporal events. Here, the temporal event data is an ordered sequence with length N: {(i1, t1), (i2, t2), . . . , (iN, tN)}, where time ti tj if ixe2x89xa6j. The temporal association discovery problem can be stated as: find all patterns whose number of co-occurrences within a time window w is above minsup. Here, the time window is introduced to essentially segment an ordered time sequence into transactions.
Finding all fa-patterns is not a trivial task because the pattern space is exponentially large, to be precise, nk, where n is the number of distinct items, and k is the maximum length of a pattern. Brute-force iteration is computationally intractable. Recently, Agrawal et al. (as described in R. Agrawal et al., xe2x80x9cMining Association Rules Between Sets of Items in Large Databases,xe2x80x9d Proc. of VLDB, pp. 207-216, 1993, the disclosure of which is incorporated by reference herein) developed an algorithm called xe2x80x9cApriorixe2x80x9d to discover all fa-patterns. This algorithm searches the pattern space in a level-wise manner by the following four step process:
1. Initialization. The data is scanned to find all fa-pattern with only one item. k is set to be 2.
2. Construct candidate patterns with length k. This is typically done by a joint operation of fa-patterns found in the previous level, followed by a pruning operation.
3. Count the candidate patterns. Data is scanned in order to count the occurrences of candidate patterns.
4. Find fa-patterns at the k-th level. Fa-patterns are those candidate patterns whose count (or occurrences) are above minsup.
This procedure proceeds level by level until no more patterns can be found. The key idea of this algorithm is to search the pattern-space in a level-wise manner. The fa-patterns found at the current level are used to eliminate the search space for the next level. In this way, the number of patterns to be searched are minimized, and the number of data scans is the maximum length of fa-patterns. Since the introduction of the xe2x80x9cApriorixe2x80x9d algorithm, work has been done to improve the algorithm so as to reduce the number of data scans, reduce the memory requirement, and improve efficiency through different search strategies.
Needless to say, association discovery has been widely employed for applications such as market basket analysis. The technique""s success is partly because frequent patterns capture patterns that are popular, and thus provide valuable information for directly marketing to relevant portions of a large population, rather than to the entire population. This maximizes the advertisement return.
However, frequent patterns may not be of equal interest in other tasks such as, for example, problem detection in computer system management, intrusion detection in computer systems, and credit card fault detection. There are several fundamental reasons for this situation:
1. Frequent patterns are not always of interest. In the aforementioned applications, the normal operations or behaviors are usually massive in quantity. It is not a surprise that a large number of frequent patterns can be found in these applications. However, most of them relate to normal behaviors, and thus are usually not very informative. This is simply because normal operations are usually known by domain expertise, and are actionable. Furthermore, even if a frequent pattern relates to a problem, it is usually known already through other means, since unknown problematic situations are, in general, infrequent in these applications.
2. Infrequent patterns may be of interest. For example, a system management application may be required to discover problematic situations that are expected to be rare. Applying existing algorithms for discovering association patterns will not result in finding infrequent patterns, unless the minimum support threshold is set to be extremely low. This, however, results in far many uninteresting patterns.
3. Co-occurrence does not necessarily reflect the dependence of items in a pattern. By definition, the occurrences of an fa-pattern is at least minsup. However, this does not guarantee any real dependence among items in an fa-pattern. In an extreme case, a set of independent items may be qualified as an fa-pattern because the frequent association does not take into consideration the distribution of each item. This is further explained by the following example. Assume that items a and b occur independently and randomly in 50% of all transactions. In this case, the expected frequency of the co-occurrence of {a,b} is 25%, which is still pretty significant, and may well be above minsup.
The present invention provides techniques for mining or discovering infrequent patterns that can not be found effectively and efficiently using existing pattern mining techniques and which may be valuable for a variety of applications. Specifically, the invention defines a mutual dependence pattern or xe2x80x9cm-pattern.xe2x80x9d An m-pattern captures a set of items that often occur together regardless of the number of occurrences. Thus, infrequent, but interesting, patterns may be found.
In one aspect of the invention, a technique for mining one or more patterns in an input data set of items comprises identifying one or more sets of items in the input data set as one or more patterns based on respective comparisons of conditional probability values associated with each of the one or more sets of items to a predetermined threshold value. The one or more identified patterns are output based on results of the comparisons. The input data set may comprise such data as event data and/or transaction data.
In one embodiment, the identifying operation may comprise identifying a set of items in the input data set, which includes at least two subsets of at least one item, as a pattern when the set of items has a conditional probability value computed therefor that is not less than a predetermined threshold value, wherein the conditional probability value is indicative of a probability that both of the at least two subsets of at least one item will occur given that one of the at least two subsets of at least one item has occurred.
In another embodiment, the identifying operation may comprise identifying a set of items in the input data set as a pattern when the set of items has a conditional probability value computed for the set of items minus a particular item of the set, given the particular item of the set, that is not less than a predetermined threshold value.
In another aspect of the invention, a technique for mining one or more patterns in an input data set of items comprises: obtaining an input data set of items; searching the input data set of items to identify one or more sets of items in the input data set as one or more patterns based on respective comparisons of conditional probability values associated with each of the one or more sets of items to a predetermined threshold value; and outputting the one or more identified patterns based on results of the comparisons.
Prior to the searching operation, the input data set may be normalized so that the data is not application-dependent. The outputting operation may convert the one or more identified patterns into a human readable format. The searching operation may comprise performing a level-wise scan based on a set length to determine candidate sets of items in the input data set that have conditional probability values respectively computed therefor that are not less than the predetermined threshold value. The search step may also comprise pruning candidate sets based on an upper bound property. In one embodiment, the upper bound property specifies that only candidate sets are considered where the conditional probability of a set of items minus a particular subset of items given the particular subset of items is not greater than the number of occurrences of the set of items minus the particular subset of items divided by the number of occurrences of the subset of items.
Discovering such m-patterns may benefit many applications. For example, such an m-pattern in an event management application indicates that a set of events, if they occur, occur together with high probability. This implies strong correlation among events. Thus, event correlation rules can be developed for the purpose of either event compression or on-line monitoring. In another example, an m-pattern in a customer transaction analysis indicates that a set of items, say milk and bread, are likely to be bought together. By knowing such information, a store manager may better arrange items in the store (e.g., putting milk and bread in nearby locations). The store manager may also develop a better promotion strategies (e.g., lowering the price of milk, but increasing the price of bread). Of course, in accordance with the principles of the invention taught herein, one of ordinary skill in the art will realize many other applications of m-patterns.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.