1. Field of the Invention
The present invention relates to data mining, and more particularly to mining long patterns in a large data set.
2. Description of the Related Art
Finding patterns in data has been an important but difficult task in many industries. Pattern discovery is of increasing importance in many applications including but not limited to biology study, consumer behavior analysis, system performance analysis, etc. A pattern discovery problem may be formulated as follows. Let D={d1, d2, . . . , dm} be a set of literals. A pattern P is a subset of items in D. A pattern consisting of k items is usually referred to as a k-pattern. Given two patterns P1 and P2, P1 is called a sub-pattern of P2 if P1 can be generated by removing some item(s) from P2. In such a case, P2 is called a super-pattern of P1. For example, d1d2 is a sub-pattern of d1d2d3. The sub-pattern/super-pattern relationship defines a lattice among all patterns. Given a data set T and a pattern P, let s(P) denote the significance of P in T. The goal is to discover all patterns {P|s(P)xe2x89xa7min_sig}, where min_sig is a user-specified threshold. A pattern is also referred to as a significant pattern P if s(P)xe2x89xa7min_sig. Otherwise, it is called an insignificant pattern. Serious challenges are posed to the design of a mining algorithm because the data set is typically very large (i.e., only a small fraction of the entire data set can be held in memory at once) and patterns may be substantially long (i.e., including a large number of items or events). Even with the help of the well-known Apriori property, the traditional level-wise algorithm becomes very slow.
The Apriori property states that the significance of a pattern is always less than or equal to the significance of any of its sub-patterns. This leads to a level-wise iterative evaluation of significance of patterns: at the kth level, all candidate k-patterns are evaluated via a scan of the data and the set of significant k-patterns are identified and used to generate candidate (k+1)-patterns according to the Apriori property. The process ends when no new candidate pattern can be generated. For example, if d1d2, d1d3, d1d4, d1d5, . . . , are significant 2-patterns, then d1d2d3, d1d2d4, d1d3d4, d1d2d5, d1d3d5, d1d4d5, . . . , are candidate 3-patterns in the level-wise search. It is easy to see that k scans of the data are required if a significant pattern may consist of up to is k items.
Some effort has been made to further improve performance of finding significant patterns, especially to address the inefficiency incurred by a relatively large value of k. MAXMINER (available commercially from IBM) introduced a xe2x80x9clook-aheadxe2x80x9d strategy in addition to the traditional level-wise approach. During the generation of candidate (k+1)-patterns from significant k-patterns, some additional candidate patterns of higher levels are also generated and evaluated together with the candidate (k+1)-patterns. In the above example, d1d2d3d4d5 will also be generated as a candidate pattern. Note that if d1d2d3d4d5 is significant, then any of its sub-patterns is significant without further test. Though this approach can reduce the number of scans of the data to some extent, such reduction may be insufficient and not guaranteed in many cases, especially when significant patterns include a large number of items. As a result, the MAXMINER is only suitable to mining patterns that include items in the range of dozens.
Another approach includes the use of sampling. In this approach, a set of samples are first gathered, and the significant patterns in the sample data are computed. F represents the set of significant patterns in the sample data and their immediate super-patterns. The significances of patterns in F are then computed based on the entire dataset and serve as the (advanced) starting position of a level-wise search that eventually identifies all significant patterns. This strategy is efficient if the number of significant patterns that fail to be recognized from the sample is small, which is typically the case under the assumption of a reasonably large sample size and a relatively short pattern length. However, the number of candidate patterns may be considerably large and may not fit into the main memory all at once. In turn, multiple scans of the entire data set may be required.
Therefore, a need exists for a system and method which mines patterns from a large data set. A further need exists for an efficient method which mines data from a large data set only a few scans.
A system and method are provided for discovering significant patterns from a list of records in a dataset. Each record includes a set of items, and each significant pattern includes a subset of items such that a significance of the pattern exceeds a significance level. A significance is computed for each item in the list of records to determine significant items. The records are randomly sampled to select a sample portion of the records. Ambiguous patterns are identified against the sample portion of the records and verified against the entire list of records in the dataset. This provides an efficient method for data mining.
In other embodiments, a significance computation may be performed for each item in a single scan of the list of records. The step of sampling the records may include choosing a sample size according to a size of available memory space for storing sampled records.
The step of identifying ambiguous patterns may include an iterative loop, wherein a kth iteration identifies all ambiguous k-patterns. The method may include the steps of generating candidate k-patterns, computing the significance of each candidate k-pattern in the sample portion and labeling candidate k-patterns as significant, ambiguous, or insignificant according to the significance of each candidate k-pattern in the sample portion.
A Chernoff Bound may be used to label candidate k-patterns. A domain of each candidate k-pattern may be computed from the significance of each involved item from the list of records of the dataset. Ambiguous patterns may be verified against the entire list of records. This may include pruning ambiguous patterns by employing an ordered pruning method. The ambiguous patterns may be pruned to fit together in a same memory space.
The ordered pruning method may include an iterative process, wherein each iteration may include computing a set of halfway patterns in a space of ambiguous patterns, determining significances of the halfway patterns in the entire list of records of the dataset and reducing the space of ambiguous patterns.