1. Field of the Invention
The present invention relates in general to the field of database analysis. In one aspect, the present invention relates to a system and method for data mining operations for identifying association rules contained in database records.
2. Description of the Related Art
The ability of modern computers to assemble, record and analyze enormous amounts of data has created a field of database analysis referred to as data mining. Data mining is used to discover association relationships in a database by identifying frequently occurring patterns in the database. These association relationships or rules may be applied to extract useful information from large databases in a variety of fields, including selective marketing, market analysis and management applications (such as target marketing, customer relation management, market basket analysis, cross selling, market segmentation), risk analysis and management applications (such as forecasting, customer retention, improved underwriting, quality control, competitive analysis), fraud detection and management applications and other applications (such as text mining (news group, email, documents), stream data mining, web mining, DNA data analysis, etc.). Association rules have been applied to model and emulate consumer purchasing activities by describing how often items are purchased together. Typically, a rule consists of two conditions (e.g., antecedent and consequent) and is denoted as AC where A is the antecedent and C is the consequent. For example, an association rule, “laptopspeaker (80%),” states that four out of five customers that bought a laptop computer also bought speakers.
The first step in generating association rules is to review a database of transactions to identify meaningful patterns (referred to as frequent patterns, frequent sets or frequent itemsets) in a transaction database, such as significant purchase patterns that appear as common patterns recurring among a plurality of customers. Typically, this is done by using constraint thresholds such as support and confidence parameters, or other guides to the data mining process. These guides are used to discover frequent patterns, i.e., all sets of itemsets that have transaction support above a pre-determined minimum support S and confidence C threshold. Various techniques have been proposed to assist with identifying frequent patterns in transaction databases, including using “Apriori” algorithms to generate and test candidate sets, such as described by R. Agrawal et al., “Mining Association Rules Between Sets of Items in Large Databases,” Proceedings of ACM SIGMOD Int'l Conf. on Management of Data, pp. 207-216 (1993). However, candidate set generation is costly in terms of computational resources consumed, especially when there are prolific patterns or long patterns in the database and when multiple passes through potentially large candidate sets are required. Other techniques (such as described by J. Han et al., “Mining Frequent Patterns Without Candidate Generation,” Proceedings of ACM SIGMOD Int'l Conf. on Management of Data, pp. 1-12 (2000)) attempt to overcome these limitations by using a frequent pattern tree (FPTree) data structure to mine frequent patterns without candidate set generation (a process referred to as FPGrowth). With the FPGrowth approach, frequency pattern information is stored in a compact memory structure.
Once the frequent sets are identified, the association rules are generated by constructing the power set (set of all subsets) of the identified frequent sets, and then generating rules from each of the elements of the power set. For each rule, its meaningfulness (i.e., support, confidence, lift, etc.) is calculated and examined to see if it meets the required thresholds. For example, if a frequent pattern {A, B, C} is extracted—meaning that this set occurs more frequently than the minimum support S threshold in the set of transactions—then several rules can be generated from this set:
{A}{B, C}
{B}{A, C}
{C}{A, B}
{A, B}{C}
etc.
where a rule AB which indicates that “Product A is often purchased together with Product B,” meaning that there is an association between the sales of Products A and B. Such rules can be useful for decisions concerning product pricing, product placement, promotions, store layout and many other decisions.
Conventional data mining approaches use generic item descriptions, such as the SKU (stockable unit number) when identifying items or products in a transaction database. When these generic descriptions are used to identify frequent sets, the frequent sets are not large and power-set/rule generation is tractable. However, conventional data mining techniques using item data at the SKU (stockable unit number) level do not provide sufficient information to develop meaningful association rules for complex products. For example, if there are three transactions involving the purchase of a computer identified as “Desktop-SKU” with one of the transactions also involving the purchase of DVD disks, the product level of description used to identify the computer does not reveal that two of the computers did not include DVD drives, while the third computer (which was purchased with the DVD disks) did include a DVD drive. As this example demonstrates, this lack of granularity in the item description diminishes the quality of association rules that can be generated, resulting in limited pattern correlation.
During the generation of association rules from frequent sets (for example, with algorithms such as FPGrowth), the number of generated rules (and processing time required to generate the rules) can become intractable as the number of frequent sets increases, often resulting in redundant rules being generated. An example of rule redundancy is rule subsumption, when a first rule R1 subsumes a second rule R2 whenever the consequents of R1 are a superset of the consequents of R2 (anything concluded by R2 is also concluded by R1), and the antecedents of R1 are satisfied in any context in which the antecedents of R2 are satisfied (antecedents of R1 are more general that the antecedents of R2). For example, with rules R1 and R2 (where R1: AC,D, and R2: A,BC,D), R1 subsumes R2. Other examples of rule redundancy include rules that provide trivial associations and rules with redundant antecedents. Conventional approaches for removing redundancy have not been effective. For example, when R1 subsumes R2, conventional association rule generation approaches (such as FPGrowth) would discard R2 if and only if the confidence of R1 is greater than or equal to the confidence of R2. For the most part, this confidence condition is rarely if ever met, as more general rules tend to have lower confidence. An article by Bayardo et al., entitled “Constraint-Based Rule Mining in Large, Dense Databases,” Proc. of the 15th Int'l Conf. on Data Engineering (1999), discusses a simple technique for applying rule subsumption when the subsumed rule has higher confidence, but this higher confidence does not meet an absolute minimum improvement threshold and is inflexibly applied.
As seen from the conventional approaches, a need exists for methods and/or apparatuses for improving the extraction of frequent patterns for use in data mining. There is also a need for finer granularity in the generation of frequent sets to better discover meaningful patterns without imposing the cost of a combinatorial explosion of the data that must be examined. In addition, there is a need for methods and/or apparatuses for efficiently generating association rules without requiring unwieldy candidate set generation, without requiring multiple database passes and without requiring additional time to generate association rules as the frequent set grows. Moreover, there is a need for an improved method and system for removing redundant association rules that allow beneficial general rules to be retained without unduly increasing the size of the generated rule set. Further limitations and disadvantages of conventional systems will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.