This invention relates to data processing and, in particular to a method and system for mining frequent patterns from databases. The invention has application in many fields including business decision making, marketing, customer relation management, medical and biological research, and the like. The field in which this invention falls is variously described as xe2x80x9cfrequent pattern miningxe2x80x9d, xe2x80x9cassociation miningxe2x80x9d, and xe2x80x9cfrequent sets miningxe2x80x9d.
Finding frequent patterns in databases gaining importance as a way to obtain valuable business information. For example, a merchant who maintains a database containing records of transactions might be interested in determining any patterns in the purchasing habits of the merchant""s customers. For example, the merchant may wonder answer questions such as xe2x80x9cwhat pairs of items are typically purchased by consumers at the same time?xe2x80x9d. A scientist studying a genome may have a database containing records of gene sequences and may wish to know whether certain sequences tend to occur together in the same stretch of DNA. A researcher studying responses given in a census, survey or consumer questionnaire may wish to identify patterns in the responses. Data mining methods can be applied to these problems and to detecting other types of correlation. Data mining is the field of deriving information about patterns expressed in large collections of information.
Various data mining methods are known. For example, Agrawal et al., U.S. Pat. No. 5,794,209 describes a method for discovering consumer purchasing tendencies. The method is implemented in a computer program which identifies consumer transaction itemsets that are stored in a database and which appear in the database a user-defined minimum number of times.
Agrawal""s method belongs to a class of data mining methods called apriori methods. Apriori methods for identifying frequent patterns begin by scanning a database of itemsets, each comprising a number of items and identifying frequent items. The methods then generate candidates for frequent patterns by taking the frequent items taken together in all possible pairs. The database is then scanned to determine the frequency of each candidate pair. Once frequent pairs have been identified then candidates for frequent itemsets each with three items can be generated by taking each frequent pair together with another frequent item. The methods then scan the database to determine which of the candidates for frequent triplets actually occur frequently. The methods can proceed iteratively to identify frequent patterns of any length.
Apriori methods take advantage of the idea that any subset of a frequent itemset must itself be frequent. No frequent itemset can include any items which are not themselves frequent in the database. Apriori-like methods use this idea to prune candidate sets. This dramatically reduces the number of candidate sets that must be checked for frequency. In essence, apriori methods use a known collection of itemsets which are frequent and have (kxe2x88x921) items to generate candidates for frequent itemsets having k items. Database scanning and pattern matching is used to collect counts for the candidate itemsets.
Apriori-like methods all have the significant disadvantage that they are much slower to execute than is desirable. The time expended by such methods is largely occupied by scanning the database. The candidate sets can be extremely large. For example, in a case where a database contains 104 frequent items, an apriori-like method will generate roughly 107 candidate itemsets of length 2. The number of candidate itemsets becomes unmanageable in cases where long patterns are being searched for. For example, to discover a frequent pattern of size 100, one needs to generate 2100≈1030 candidates. A large database may contain many gigabytes of data. Even scanning a large set of candidates against a large database takes a significant amount of time even with modern computer hardware running optimized software. Finding long frequent patterns in large databases with apriori-like methods is impractical.
Various techniques have been used to prune candidate sets to make it practical to search for long frequent patterns in databases. However, such techniques are still slow, especially in cases with large databases in which there is a reasonably large number of both long and short frequent patterns.
There is a need for methods and systems for quickly identifying frequent patterns in large databases.
This invention provides methods and apparatus for mining frequent patterns from databases. The invention has particular application when applied to large databases.
One aspect of the invention provides a method for identifying patterns from a database of records. Each record has a plurality of items. The method comprises constructing an FP-tree for the database; and, mining the FP-tree to obtain frequent patterns. In preferred embodiments of the invention, constructing the FP-tree comprises: scanning the database to obtain an ordered list of frequent items in the database; and, then, for each record in the database: creating a list of any frequent items occurring in that record in the same order as the frequent items occur in the ordered list; setting a root node of the FP-tree as a current node; and, for each item in the list of any frequent items, determining whether there is a node directly linked to the current node which corresponds to the item. If there is a node directly linked to the current node which corresponds to the item incrementing a counter for the node and setting the node as the current node. Otherwise the method creates a node corresponding to the item and linked to the current node and sets the created node as the current node. Preferably the frequent items in the ordered list are ordered in order of their frequency in the database.
The FP-tree preferably comprises a header data structure which includes a record for each of the frequent items in the database.
Mining the FP-tree to obtain frequent patterns preferably comprises: for each frequent item constructing a conditional pattern-base, and constructing a conditional FP-tree from the conditional pattern-base; recursively constructing a conditional pattern-base, and constructing a conditional FP-tree from the conditional pattern-base on each newly created conditional FP-tree until the resulting FP-tree is empty; and, after creating each FP-tree, collecting frequent itemsets from the FP-tree. Preferably the method includes determining whether a conditional FP-tree contains only one path and, of so, generating all combinations of sub-paths of the FP-tree and recording each sub-paths as a frequent pattern.
The invention also provides a method for constructing an FP-tree corresponding to a database and containing information useful for identifying frequent patterns in the database. The method comprises scanning the database to obtain an ordered list of frequent items in the database. Then, for each record in the database, the method: creates a list of any frequent items occurring in that record in the same order as the frequent items occur in the ordered list; sets a root node of the FP-tree as a current node; and, for each item in the list of any frequent items, determines whether there is a node directly linked to the current node which corresponds to the item. If so, the method increments a counter for the node and sets the node as the current node. If not, the method creates a node corresponding to the item and linked to the current node and sets the created node as the current node.
Another aspect of the invention provides a method for identifying patterns from a database of records. Each record has a plurality of items. The method comprises providing an FP-tree corresponding to the database and mining the FP-tree to obtain frequent patterns. Mining the FP-tree comprises, for each frequent item constructing a conditional pattern-base, and constructing a conditional FP-tree from the conditional pattern-base; recursively constructing a conditional pattern-base, and constructing a conditional FP-tree from the conditional pattern-base on each newly created conditional FP-tree until the resulting FP-tree is empty; and, after creating each FP-tree, collecting frequent itemsets from the FP-tree.
Another aspect of the invention provides a FP-tree data structure for use in mining frequent patterns from a database. The database contains a plurality of records. The FP-tree data structure comprises a root, a plurality of nodes linked to the root, each node associated with a frequent item from the database, the nodes linked to form a plurality of paths, the paths each corresponding to an itemset in a record of the database. Preferably the FP-tree data structure comprises a header structure. The header structure comprising an ordered list of frequent items in the database and a pointer to a node in the data structure associated with each of the frequent items.
A further aspect of the invention comprises an FP-tree data structure for use in mining frequent patterns from a database containing a plurality of records. The FP-tree data structure is resident in a storage device accessible to a computer and comprises a plurality of linked nodes and a header structure. The header structure comprises an ordered list of frequent items from a database and a pointer to at least one node associated with each of the frequent items. Each of the linked nodes is associated with one of the frequent items of the header structure. The nodes are linked to form a plurality of paths. The nodes on each of the paths correspond to frequent items present in a record of the database. Nodes associated with a selected one of the frequent items of the header structure are accessible by traversing the nodes beginning at the pointer in the header structure corresponding to the selected one of the frequent items. Preferably each of the nodes comprises a pointer capable of identifying another one of the nodes associated with the same one of the frequent items and traversing the nodes comprises sequentially following the pointers in the nodes discovered by beginning at the pointer in the header structure corresponding to the selected one of the frequent items.
A further aspect of the invention provides apparatus for mining frequent patterns from information in a database comprising a plurality of records. The apparatus comprises: a computer processor; a database having records accessible to the computer processor; a program store accessible to the computer processor; a data store accessible to the computer processor; and software instructions recorded in the program store. The software instructions are executable by the computer processor. The software instructions, when executed, cause the computer processor to: scan the database to obtain a list of frequent items in the database; create an ordered list of the frequent items ordered in order of frequency in the database; and, based on the ordered list of the frequent items and information in the database, create an FP-tree data structure corresponding to the database in the data store.
Still further aspects of the invention provide a program product comprising a medium carrying a set of computer-readable signals containing computer-executable instructions which, when run by a computer, cause the computer to execute a method of the invention.
Further features and advantages of the invention are described below.