The present invention relates generally to data processing techniques and, more particularly, to methods and apparatus for performing pairwise analysis of data.
As it becomes feasible to collect large volumes of data, businesses are increasingly looking for ways to capitalize on this data, especially market data. Thus, such businesses turn toward data mining techniques. As is known, data mining seeks to discover interesting and previously unknown patterns or information from a large amount of historical data often stored in a database, e.g., market data. Specifically, one key aspect of data mining involves searching for significant patterns embedded in the data. A pattern generally refers to a set of items denoted as pat={i1, i2, i3, . . . , ik}, where ik is the k-th item. In accordance with the present invention, we focus on pairwise patterns, i.e., patterns with two items.
To discover item sets that are of interest, two issues need to be addressed and, very often, these two issues also need to be compromised. The first issue relates to defining what patterns we intend to discover, and determining the level of interest associated with a particular pattern. To this end, many measurements have been developed: (1) frequency of item sets; (2) association rules; (3) dependency test; (4) mutual dependence; and so on. One measurement or a combination of measurements is needed to describe significant patterns for a specific application. For example, a marketing application may target the general population in order to optimize a profile. Thus, in this case, the frequency of patterns is the key measurement, see, e.g., R. Agrawal et al., xe2x80x9cMining Association Rules Between Sets of Items in Large Databases,xe2x80x9d Proceeding of VLDB, pp. 207-216, 1993, and also see, e.g., an extension of R. Agrawal et al. in H. Mannila et al., xe2x80x9cDiscovery of Frequent Episodes in Event Sequences,xe2x80x9d Data Mining and Knowledge Discovery, Vol. 1, No. 3, 1997, the disclosures of which are incorporated by reference herein.
In system management or intrusion detection tasks, discovery of infrequent, but significant, patterns may be required. Such patterns have been described in U.S. patent applications identified as: Ser. No. 09/918,253, entitled xe2x80x9cSystems and Methods for Discovering Mutual Dependence Patterns,xe2x80x9d and filed on Jul. 30, 2001; and Ser. No. 09/929,883, entitled xe2x80x9cSystems and Methods for Discovering Fully Dependent Patterns,xe2x80x9d and filed on Aug. 15, 2001, the disclosures of which are incorporated by reference herein. In both applications, patterns of interest are ones that occur non-randomly. It is known that a statistical test can be used to test whether a pattern occurs randomly. As a result, it is desirable to use the statistical test together with other measurements.
The second issue associated with pattern discovery that needs to be addressed involves determining mechanisms for efficiently mining a large amount of data, wherein the data may not be fully loaded into main memory. In one extreme, each possible pattern may be exhaustively examined in order to find all patterns satisfying a defined qualification function. However, since there are a huge number of possible item sets (to be precise, O(nk), where n is the number of distinct items and k is the length of the item sets), such a brute-force approach is not reasonable, even for a moderate number of distinct items.
R. Agrawal et al. (in the above-referenced xe2x80x9cMining Association Rules Between Sets of Items in Large Databasesxe2x80x9d article) devised the so-called xe2x80x9cApriorixe2x80x9d algorithm, and demonstrated that it is possible to discover all frequent item sets which are above a predefined minimum support (minsup). The Apriori algorithm searches item set space in a level-wise manner. At level k, the algorithm constructs candidate patterns with a length k based on qualified patterns in the previous level, and scans data to find all qualified patterns. This effectively prunes the search space. The reason that such an algorithm can be surprisingly effective in pruning the search space lies in the fact that the frequency (or support) has the downward closure property or, equivalently, the anti-monitone property. That is, if an item set is frequent above a threshold, then all its subsets are frequent above the threshold. Clearly, such an algorithm can be generalized to other qualification functions that are downward closed. Unfortunately, very few desired statistical qualification functions have such a downward closure property. For example, the confidence measure (e.g., predictive power) and the commonly-used dependency test do not satisfy this property.
So far, most previous work focused on discovering all patterns that satisfy a qualification condition that is downward closed. For example, frequent association patterns use the minsup as a qualification criterion. Other previous work defined mutual dependent patterns (m-patterns) and fully dependent patterns (d-patterns). An m-pattern requires that the mutual dependence of a pattern be above a threshold; while a d-pattern requires that all subsets of a pattern satisfy a dependency test.
However, the previous work has certain limitations. First, although a level-wise approach can significantly eliminate the pattern space, it is still computationally expensive to discover all patterns with any length, especially for long patterns. This is because all subsets of a pattern are qualified patterns. As a result, the algorithm needs to enumerate all the subsets. For example, a pattern with length 20 has more than 1030 qualified subpatterns, requiring enormous storage space and unrealistic computational time for a level-wise algorithm to discover such a pattern.
Second, to apply an efficient algorithm, a qualification measurement needs to have the downward closure property. Consequently, arbitrary conditions that are best for an application can not be designed. Many valuable measurements, such as predictive power and statistical testing, can not be applied in the mining process. Unfortunately, this results in missing many important patterns and discovering many false patterns.
Lastly, and most importantly, pattern discovery is an interactive and iterative process in which patterns found should be validated by domain experts before used in production. However, in practice, it is very difficult for humans to make sense of a long pattern because of the exponentially possible number of relationships of items. For example, in a pattern with three items, i.e., {a, b, c}, there can be many types of relationships: three items may be cascaded (e.g., a causes b and b causes c); two of the three items may trigger the third one (e.g., a and b cause c); and one item may trigger the other two (e.g., a causes b and c). Clearly, each type has many possible permutations. As a result, how to interpret a pattern with three or more items becomes a very difficult task for a domain expert.
The present invention provides techniques for mining or discovering two-item or pairwise patterns. Preferably, the pairwise patterns are mined or discovered from a large amount of temporal historical events.
In a first aspect of the invention, a technique for mining or discovering one or more patterns in an input data set, wherein the input data set is characterized by attributes, comprises the following steps. First, the technique includes mapping attributes of the input data set to mapping values. Then, one or more candidate patterns are formed as groupings of two mapping values that occur within a predefined time period. Next, for each of the one or more candidate patterns, a qualification function is computed and a result of the qualification function is compared with at least one predefined threshold value. The one or more candidate patterns whose qualification function results are greater than or equal to the predefined threshold value are identified as one or more qualified patterns.
In a second aspect of the invention, the mapping step comprises mapping attributes of at least two types associated with the input data set to item values, and forming one or more candidate patterns as groupings of two item values that occur within a predefined time period. When the input data set comprises event data, the at least two attribute types are an event type and an event source identifier. This mapping corresponds to inter-host mapping.
In a third aspect of the invention, the mapping step comprises mapping attributes of at least one type associated with the input data set to item values, mapping attributes of at least another type associated with the input data set to key values, and forming one or more candidate patterns as groupings of two item values that are associated with the same key value and occur within a predefined time period. When the input data set comprises event data, the at least one attribute type is an event type and the at least another attribute type is an event source identifier. This mapping corresponds to intra-host mapping.
In a fourth aspect of the present invention, the qualification function may comprise one or more significance measurements such as an occurrence count, a predictive power measure, a mutual dependence measure, a correlation score, and a chi-squared score. Still further, the mapping values (e.g., item values and key values), qualification function, and the threshold value are preferably specified by a user.
In a fifth aspect of the present invention, the one or more qualified patterns may be visualized. The visualizations may comprise lists and/or graphical representations. By way of one graphical representation, a qualified pattern may be visually presented to a user by nodes representing at least one of the attributes of the input data set and a link connecting the nodes representing that the nodes form a qualified pattern. The link may be visually encoded to represent the result of the qualification function. Also, the node may be visually encoded to represent a measure e.g., an occurrence count, associated with a corresponding attribute.
In a sixth aspect of the present invention, the qualified patterns may be summarized to reveal multi-item relationships. The summarizing operation may comprise grouping qualified patterns having at least one identical mapping value (e.g., item value) into a group. Alternatively, the summarizing operation may comprise merging qualified patterns into one or more higher order patterns.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.