1. Field of the Invention
This invention relates generally to data processing, and more particularly to "computer database mining" in which association rules which characterize a relationship between significant transactions that are recorded in a database are identified. In particular, the invention concerns the identification (i.e., mining) of rules in a large database of "dense" data transactions using one or more constraints during the mining process.
2. Description of the Related Art
Customer purchasing habits can provide invaluable marketing information for a wide variety of applications. This type of data may be known as market basket data. For example, retailers can create more effective store displays and more effectively control inventory than otherwise would be possible if they know that, given a consumer's purchase of a first set of items (a first itemset), the same consumer can be expected, with some degree of likelihood of occurrence, to purchase a particular second set of items (a second itemset) along with the first set of items. In other words, it is helpful from a marketing standpoint to know the association between the first itemset and the second itemset (the association rule) in a given data-set. For example, it would be desirable for a retailer of automotive parts and supplies to be aware of an association rule expressing the fact that 90% of the consumers who purchase automobile batteries and battery cables (the first itemset) also purchase battery post brushes and battery post cleansers (referred to as the "consequent" in the terminology used in the present description). Market basket data is data in which there are one or more data elements representing purchased items, such as bread, milk, eggs, pants, etc., in a transaction, such as an individual consumer purchase. For market basket data, no data element has only a limited predetermined set of values, such as male or female, so that the values occur frequently. For example, the first data element in any transaction may be any item which may be purchased by the consumer so that one can not assume, for example, that the first data element contains a milk item. Thus, since each data element may have a variety of values, the market basket data is not "dense" data.
Other types of data, however, such as telecommunications data, census data and data typical of classification and predictive modeling tasks, may be "dense" data. A dataset may be considered to contain "dense" data if a particular data element in each transaction may have a predetermined set of frequent values. For example, each transaction in census data may contain the same first data element containing a data field with information about the gender of the person represented by the transaction. In addition, this gender data element may only have two values (i.e., "male" or "female") which means that these two values must appear very frequently in the dataset. In fact, most "dense" data has multiple data elements which have a predetermined set of frequent values.
Until recently, building large detailed databases that could chronicle thousands or even millions of transactions was impractical. In addition, the derivation of useful information from these large databases (i.e., mining the databases) was highly impractical due to the large amounts of data in the database which required enormous amount of computer processing time to analyze. Consequently, in the past, marketing and advertising strategies have been based upon anecdotal evidence of purchasing habits, if any at all, and thus have been susceptible to inefficiencies in consumer targeting that have been difficult if not impossible to overcome.
Modem technology, such as larger, faster storage systems and faster microprocessors, have permitted the building of large databases of consumer transactions and other types of data. However, building a transactions database is only part of the challenge. Another important part of the challenge is mining the database for useful information, such as the association rules. The database mining, however, becomes problematic as the size of the database expands into the gigabyte or terabyte size.
Not surprisingly, many methods have been developed for mining these large databases. The problem of mining association rules from large databases was first introduced in 1993 at the ACM SIGMOD Conference of Management of Data in a paper entitled, "Mining Association Rules Between Sets of Items in a Large Database" by Rakesh Agrawal, Tomasz Imielinski and Arun Swami. In general, the input, from which association rules are mined, consists of a set of transactions where each transaction contains a set of literals (i.e., items). Thus, let I={l.sub.1, l.sub.2, . . . l.sub.m } be a set of literals called items. Let D be a set of transactions, where each transaction T is a set of items such that T.OR right.I. Therefore, a transaction T contains a set A of some items in I if A.OR right.T.
An association rule is an implication of the form A{character pullout}B, where A.OR right.I, B.OR right.I, A.andgate.B=.O slashed. and B is the consequent of the rule. The rule A{character pullout}B holds true in the transaction set D with a confidence "c" if c % of transactions in D that contain A also contain B (i.e., the confidence in the conditional probability p(B.vertline.A)). The rule A{character pullout}B has support "s" in the transaction set D if s transactions in D contain A.orgate.B (i.e., the support is the probability of the intersection of the events). The support s may also be specified as a percentage of the transactions in the data-set that contain A.orgate.B. An example of an association nile is that 30% of the transactions that contain beer and potato chips also contain diapers and that 2% of all transactions contains all of these items. In this example, 30% is the confidence of the association rule and 2% is the support of the rule. The typical problem is to find all of the association rules that satisfy user-specified constraints. As described above, this mining of association rules may be useful, for example, to such applications as market basket analysis, cross-marketing, catalog design, loss-leader analysis, fraud detection, health insurance, medical research and telecommunications diagnosis.
Most conventional data mining systems and methods, such as a method known as Apriori and its descendants, are developed to tackle finding association rules in market basket data which is not dense data. The problem is that these conventional systems, when faced with dense data such as census data, experience an exponential explosion in the computing resources required. In particular, these conventional systems mine all association rules (also referred to simply as rules) satisfying a minimum support constraint, and then enforce other constraints during a post-processing filtering step. Thus, for the dense census data, any transaction containing male or female may be mined. However, this generates too many rules to be useful and takes too much time. During the post-processing, the total number of rules may be reduced by applying a minimum predictive accuracy constraint, such as minimum confidence, lift, interest or conviction. However, even with these additional post-processing constraints, these conventional systems still generate too many rules for dense data which 1) take too long to generate, and 2) can not be easily comprehended by the user of the system.
There are also other conventional data mining systems for "dense" data, such as heuristic or "greedy" rule miners, which try to find any rules which satisfy a given constraint. An example of a greedy miner is a decision tree induction system. These conventional systems generate any rules satisfying the given constraints or a single rule satisfying the constraints, but do not necessarily generate a complete set of rules which may satisfy the given constraints. These conventional systems also do not attempt to determine a "best" rule (e.g., most predictive) so that, at best, an incomplete set of rules, none of which may be a best rule, may be generated which is not useful to the user of the system.
Other conventional methods have investigated incorporating item constraints on the set of frequent itemsets in an effort to provide faster association rule mining. These constraints, however, only restrict which items or combinations of items are allowed to participate in mined rules. In addition, for these methods to work efficiently on many dense data-sets, the user must specify very strong constraints that bound the length of the frequent itemsets which is not always possible given a user's potential limited understanding of the data. There is also some work on ranking association rules using interest measures. However, because they are applied only during post processing, it is unclear how these measures could be exploited to make mining on dense data-sets feasible. It is desirable to be able to generate a complete set of rules for dense data which can not be accomplished by these conventional systems.
Therefore, a system and method for constraint-based mining of dense data-sets which avoids the above-identified and other problems of the conventional systems and methods is needed, and it is to this end that the present invention is directed.