Finding frequent patterns in databases is a fundamental operation behind several common data-mining tasks including association-rule and sequential-pattern mining. An example of association-rule data mining is the Apriori method described by Agrawal et al. in U.S. patent application for "System and Method for Quickly Mining Association Rules In A Database," Ser. No. 08/415,006. Sequential-pattern data mining is described, for example, by Srikant et al. in U.S. Pat. No. 5,742,811 for "Method for Mining Generalized Sequential Patterns In A Large Database." For the most part, frequent-pattern mining methods have been developed to operate on databases in which the longest frequent patterns are relatively short, e.g., those with less than 10 items. A prototypical application of frequent-pattern mining is market-basket analysis, where the goal is to identify store items that are frequently purchased together. It is unusual for the frequent patterns in these databases to contain more than 10 items because most people purchase relatively few items at a time, and due to the variety of items available, shopping habits can be quite diverse. There is a wealth of data that does not fit the mold of retail data, yet remains ripe for exploration through frequent-pattern mining techniques.
Two recent papers investigated the application of association-rule miners to such data. In the first paper entitled "Dynamic Itemset Counting and Implication Rules for Market Basket Data," Proc. of the 1997 SIGMOD Conf. on the Management of Data, pp. 255-264, Brin et al. applied their association-rule miner to a database compiled from census records. In the second paper, "Brute-Force Mining of High-Confidence Classification Rules," Proc. of the Third Int'l Conf. on Knowledge Discovery and Data Mining, pp. 123-126, Bayardo investigated the use of an association rule miner to mine classification rules from commonly used classification benchmarks in the Irvine Machine Learning Database Repository (available on the World Wide Web at http://www.ics.uci.edu/.about.mlearn/MLRepository.html).
A common finding of these papers is that previously developed methods for mining associations from retail data are inadequate on complex data sets. Brin et al. had to remove all items from their data set appearing in over 80% of the transactions, and even then could only mine efficiently at high SUPPORT levels. The support of an item is defined as the number of data sequences in the database that contain the item. Bayardo had to apply several additional pruning strategies beyond those in typical association rule miners, several of which rendered the search incomplete. The difficulty of these data sets results from a long average record length (more "items" per "transaction" than in retail sales data) and the characteristic that many items appear with high frequency. Data sets used for classification tend to have these qualities. Other examples include questionnaire results (people tend to answer similarly to many questions), retail sales data involving complex system configurations or where purchases across a large time window are compiled into a single transaction, and biological data from the fields of DNA and protein analysis.
Almost every recently-proposed method for mining frequent patterns is a variant of the Apriori method. When the size of frequent patterns is small, Apriori uses a bottom-up search-space pruning strategy that is very effective. The strategy exploits the fact that a pattern can be frequent if and only if every one of its sub-patterns is frequent. A pattern is called FREQUENT if the items in the pattern appear together in the database with a defined regularity (or a minimum support specified by the user). By considering patterns only when their sub-patterns have been determined to be frequent, the number of patterns that turn up infrequent, yet are still "checked" against the database, is kept to a minimum. Unfortunately, this approach is fundamentally intractable for mining long patterns simply because the number of sub-patterns of a pattern grows exponentially with pattern length. For example, the number of patterns that must be considered to generate a length l association rule is 2'.
There are many variants of Apriori that differ primarily in the manner by which patterns are checked against the database. Apriori in its purest form checks patterns of length l during database pass l. Brin et al.'s method is more eager and begins checking a pattern for frequency shortly after all its subsets have been determined frequent, rather than waiting until the database pass completes. In a paper entitled "An Efficient Algorithm for Mining Association Rules in Large Databases," Proc. of the 21st Conf. on Very Large Data-Bases, pp. 432-444, Savasere et al. describe a method that identifies all frequent patterns in memory-sized partitions of the database, and then checks these against the entire database during a final pass. Brin et al.'s method considers the same number of "candidate" patterns as Apriori, and Savasere et al.'s method can consider more (but never fewer) candidate patterns than Apriori, potentially exacerbating problems associated with long frequent patterns.
Still another variant of Apriori, described by Park et al. in "An Effective Hash Based Algorithm for Mining Association Rules," Proc. of the 1995 SIGMOD Conf. on the Management of Data, pp. 175-186, enhances it with a hashing scheme that can identify (and thereby eliminate from consideration) some candidates that will turn up infrequent if checked against the database. It also uses the hashing scheme to re-write a smaller database after each pass in order to reduce the overhead of subsequent passes. Still, like Apriori, it checks every sub-pattern of a frequent pattern.
In a paper entitled "Discovering All Most Specific Sentences by Randomized Algorithms," Proc. of the 6th Int'l Conf. on Database Theory, pp. 215-229, 1997, Gunopulos et al. present a randomized method for identifying maximal frequent patterns in memory-resident databases. The method is purely greedy, iteratively attempting to extend the length of a working pattern without examining all its sub-patterns. An incomplete (in the sense that it provides no guarantee that all maximal frequent patterns will be found) version of the method is evaluated and found to efficiently extract long maximal frequent patterns. Unfortunately, it is not clear how this method would be scaled to disk resident data sets, nor is it clear how the proposed complete version would perform.
Zaki et al. present two methods for identifying maximal frequent-patterns, namely, MaxEclat and MaxClique, in the paper entitled "New Algorithms for Fast Discovery of Association Rules," Proc. of the Third Int'l Conf. on Knowledge Discovery in Databases and Data Mining, pp. 283-286. These methods attempt to look ahead and identify long patterns early on to help prune the space of patterns considered. Though MaxEclat and MaxClique are demonstrated to be advantageous on random data with short maximal patterns, they are still prone to performance problems on data sets with long patterns. Both MaxEclat and MaxClique identify coarse clusters of potentially frequent patterns early on in the search. Due to cluster inaccuracies, they identify only a single maximal pattern per cluster, and afterwards employ a purely bottom-up approach. Though the set of candidates considered can be reduced, due to the Apriori-like bottom-up phase, both methods still scale exponentially with pattern length. The cluster identification phase of MaxEclat also scales exponentially with pattern length since it uses a dynamic programming algorithm for finding maximal cliques in a graph whose largest clique is at least as large as the length of the longest pattern.
Therefore, there is still a need of a method for efficiently mining long patterns from databases that is orders of magnitude faster at mining long maximal-patterns than Apriori-like algorithms, scales roughly linearly in the number of maximal patterns rather than exponentially in the length of the longest patterns, and in which the number of database passes remains bounded by the length of the longest pattern.