1. Field of the Invention
This invention relates generally to data processing, and more particularly to "computer database mining" in which generalized association rules between significant transactions that are recorded in a database are discovered. In particular, the invention concerns the identification (i.e., mining) and classifying of association rules from a large database.
2. Description of the Related Art
Customer purchasing habits can provide invaluable marketing information for a wide variety of applications. For example, retailers can create more effective store displays and more effectively control inventory that otherwise would be possible if they know that, given a consumer's purchase of a first set of items (a first itemset), the same consumer can be expected, with some degree of probability, to purchase a particular second set of items (a second itemset) along with the first set of items. In other words, it is helpful from a marketing standpoint to know that an association exists between the first itemset and the second itemset (the association rule) in a transaction. For example, it would be desirable for a retailer of automotive parts and supplies to be aware of an association rule expressing the fact that 90% of the consumers who purchase automobile batteries and battery cables (the first itemset) also purchase battery post brushes and battery post cleansers (referred to as the "consequent" in the terminology of the invention).
Advertisers too may benefit from a thorough knowledge of such consumer purchasing tendencies since they may change their advertising based upon the information mined from the database. In addition, catalog companies may be able to conduct more effective mass mailings if they know the tendencies of consumers to purchase particular sets of items with other set of items. It is understood, however, that although this discussion focuses on the marketing applications of the invention, database mining and, hence, the principles of the invention, are useful in many other areas such as business or science, for example.
Until recently, building large detailed databases that could chronicle thousands or even millions of transactions was impractical. In addition, the derivation of useful information from these large databases (i.e., mining the databases) was highly impractical due to the large amounts of data in the database which required enormous amount of computer processing time to analyze. Consequently, in the past, marketing and advertising strategies have been based upon anecdotal evidence of purchasing habits, if any at all, and thus have been susceptible to inefficiencies in consumer targeting that have been difficult if not impossible to overcome.
Modern technology, such as larger, faster storage systems and faster microprocessors, have permitted the building of large databases of consumer transactions. In addition, the bar-code reader may almost instantaneously read so called basket data (i.e., when a particular item from a particular lot was purchased by a consumer, how many items the consumer purchased, as so on) so that the basket data may be stored. In addition, when the purchase is made with, for example, a credit card, the identity of the purchaser is also known and may be recorded along with the basket data.
As described above, however, building a transactions database is only part of the marketing challenge. Another important part of the marketing challenge is mining the database for useful information, such as the association rules. The database mining, however, becomes problematic as the size of the database expands into the gigabyte or terabyte size.
Not surprisingly, many methods have been developed for mining these large databases. The problem of mining association rules from large databases was first introduced in 1993 at the ACM SIGMOD Conference of Management of Data in a paper entitled, "Mining Association Rules Between Sets of Items in a Large Database" by Rakesh Agrawal, Tomasz Imielinski and Arun Swami. In general, the input, from which association rules are mined, consists of a set of transactions where each transaction contains a set of literals (i.e., items). An example of an association rule is that 30% of the transactions is a particular database that contain beer and potato chips also contain diapers and that 2% of all transactions contains all of these items. In this example, 30% is the confidence of the association rule and 2% is the support of the rule. The problem is to find all of the association rules that satisfy user-specified minimum support and confidence constraints. As described above, this mining of association rules may be useful, for example, to such applications as market basket analysis, cross-marketing, catalog design, loss-leader analysis, fraud detection, health insurance, medical research and telecommunications diagnosis.
To better understand the context of the invention, a brief overview of typical association rules and their derivation is now provided. First let I={1.sub.1, 1.sub.2, . . . 1.sub.m } be a set of literals called items. Let D be a set of transactions, where each transaction, T, is a set of items such that T.OR right.I. Therefore, a transaction, T, contains a set A of some items in I if A.OR right.T. An association rule is an implication of the form A{character pullout}B, where A.OR right.I, B.OR right.I, and A.solthalfcircle.B=.O slashed.. The rule A{character pullout}B holds true in the transaction set D with a confidence, c, if c % of the transactions in D that contain A also contain B (i.e., the confidence in the conditional probability p(B.vertline.A)). The rule A{character pullout}B has support, s, in the transaction set D if s % of the transactions in D contain A.orgate.B (i.e., the support is the probability of the intersection of the events). Given a set of transactions, D, the computational task of mining association rules is to generate all association rules that have a support value and a confidence value greater than a user-specified minimum support value and minimum confidence value.
The task of mining association rules may be decomposed into two steps. First, all of the combinations of items are found which have a transactions support above the minimum user-defined support and these combinations of items are called frequent itemsets. Next, the frequent itemsets are used to generate desired association rules. In particular, if ABCD and AB are frequent itemsets, then it is possible to determine if the association rule AB{character pullout}CD holds by computing the ratio, r={support (ABCD)/support (AB)}. The association rule holds only if r.gtoreq.the minimum confidence value. The first step of this association rule determination process requires the most computational time and therefore has been the focus of a great number of efforts to develop fast algorithms to discover frequent itemsets.
The second part of the association rule generation process has received much less attention. In particular, the process for analyzing the generating association rules for statistical significance has received the scant attention. Conventional association rule algorithms, as described above, may produce a very large number of output association rules. A large number of discovered association rules is, however, equally difficult to generate useful information from since these large number of discovered rules are no easier to review than the original data from which the association rules have been derived. The large number of discovered association rules has also raised the question of whether the set of discovered association rules "overfit" the data because all of the possible patterns that satisfy some constraints are generated which is known as the Bonferroni effect. In other words, the question is whether some of the discovered rules are "false discoveries" that are not statistically significant.
One conventional method for estimating significant association rules used a chi-squared test to look for correlated association rules, but did not take into account the number of hypotheses which were being tested. Another conventional method had a similar idea in arguing that a rule X{character pullout}Y is not interesting if support (X{character pullout}Y).apprxeq.support (X).times.support (Y), but once again did not consider the number of hypotheses. It is desirable, however, to provide a system and method for discovering predictive association rules which takes into account the number of hypotheses and removes statistically insignificant association rules and it is to this end that the present invention is directed.