1. Field of the Invention
The present invention generally relates to data mining and, more particularly, to mining association rules with numerical weighted attribute values.
2. Background Description
Data mining is the extraction of implicit knowledge and discovery of interesting characteristics and patterns that are not explicitly presented in the data. These techniques can play an important role in presenting data in a concise manner and accommodating data semantics. During recent years, one active topic in this field is association rule discovery which is introduced in R. Agrawal, T. Imielinski and A. Swami, xe2x80x9cMining association rules between set of items in large databasesxe2x80x9d, Proceedings of the 1993 ACM SIGMOD Conference on Management of Data, 1993. In general, the problem can be modeled as follows: let I be a set of items and T be a set of transactions, each of which consists of a subset of items in I. An association rule is an implication in the form of xe2x80x9cXxe2x86x92Yxe2x80x9d, where X and Y are sets of disjoint item sets. xe2x80x9cXxe2x86x92Yxe2x80x9d holds in T with support s and confidence c if s % transactions contain all items in the set xe2x80x9cX∪Yxe2x80x9d and c % transactions which contain all items in X also contain all items in Y. The goal of association rule mining process is to find all these rules with respect to some minimum s and c. An item (or itemset) is called a xe2x80x9cfrequentxe2x80x9d item (or itemset) if its support is at least s.
These conventional association rules have been widely used in many application domains. In the market basket problem, each piece of merchandise in a supermarket can be viewed as an item, and the set of items that a customer purchases at one time period can be viewed as a transaction. The association rules represent the likelihood of items being purchased together by a customer. Thus, this knowledge can be used to guide sale promotions. For example, if there is a high confidence between soda and snacks, then the supermarket may reduce the price of soda to increase the sale of snacks.
However, the traditional association rules focus on binary attributes. In other words, this approach only considers whether an item is present in a transaction, but does not take into account the weight/intensity of an item within a transaction. For example, a customer may purchase ten bottles of soda and five bags of snacks and another may purchase four bottles of soda and one bag of snacks at a time. However, these two transactions will be the same in the conventional association rule approach. This could lead to loss of some vital information; i.e., intensity or weight of each item in a transaction. For example, if a customer buys more than seven bottles of soda, he or she is likely to purchase three or more bags of snacks; otherwise, the purchase tendency of soda is not strong. The traditional association rule can not express this type of relationship. With this knowledge, the supermarket manager may set a promotion such as xe2x80x9cif a customer buys six bottles of soda, he or she can get two free bags of snacks.xe2x80x9d
A new variation of association rules has been proposed in R. Srikant, and R. Agrawal, xe2x80x9cMining quantitative association rules in large relational tablesxe2x80x9d, Proceedings of the 1996 ACM SIGMOD Conference on Management of Data Montreal, Canada, June, 1996. The domain of each quantitative attribute is divided into a set of intervals (i.e., buckets, ranges) via equi-depth partitioning; i.e., each interval is with a similar number of attribute values. Intervals are combined as long as their support is less than the user specified max-support. Each of original intervals and the combined intervals is treated as a different item. Thus, the attribute value of the newly xe2x80x9cgeneratedxe2x80x9d items is binary: xe2x80x9c1xe2x80x9d stands for that the original attribute value is within the specified interval; xe2x80x9c0xe2x80x9d stands for that the original attribute value is not within the specified interval. The algorithms for mining traditional association rules can then be applied on these newly generated binary items.
During the mapping process, one numerical attribute will be mapped into O(k2) binary attributes where k is the number of intervals for one attribute domain. When the number of numerical attributes is large, the number of mapped binary attributes could be unmanageable.
It is therefore an object of the present invention to provide an efficient method for mining a specific type association rule, the weighted association rule.
According to the invention, The traditional association rule problem is extended by allowing a weight to be associated with each item in a transaction to reflect interest/intensity of each item within the transaction. In turn, this provides an opportunity to associate a weight parameter with each item in a resulting association rule, called a weighted association rule (WAR). For example, xe2x80x9csoda [4, 6]xe2x86x92snack [3, 5]xe2x80x9d is a weighted association rule indicating that if a customer purchases soda in the quantity between four and six bottles, he or she is likely to purchase three to five bags of snacks. Thus, WARs can not only improve the confidence of the rules, but also provide a mechanism to do more effective target marketing by identifying or segmenting customers based on their potential degree of loyalty or volume of purchases.
There can be a very large number of items in any given problem, and every item has a numerical attribute, although only a small fraction of items are present in a transaction. Thus, according to the preferred method of the present invention, the frequent itemsets are first generated (without considering weights). This method differs from the prior art. The weight range of each item in each frequent itemset is divided into a sequence of intervals, referred to as base intervalsxe2x80x9d. Then the weight domain space of each frequent itemset is partitioned into fine grids. A grid is a multi-dimensional cube whose span on each dimension is exactly one base interval. The density of a grid is defined as the ratio of the actual number of transactions that belong to it over the expected number of transaction in it, assuming random uniform distribution. A density threshold is used to separate the transaction concentrating regions from the rest. A grid is called a xe2x80x9cdense gridxe2x80x9d if its density is above the pre-specified threshold. Otherwise, it is called a xe2x80x9csparse gridxe2x80x9d. A xe2x80x9cdense regionxe2x80x9d is the union of a set of adjacent dense grids. WARs can be identified based on these xe2x80x9cdensexe2x80x9d regions.
In reality, the number of valid weighted association rules (WARs) could be very large, and a user may not be interested in all WARs, but rather a small subset of them. For example, if the inventory of soda becomes too large in a supermarket, the manager may be only interested in the WARs involving soda. It would be desirable if the user can interact with the mining system to obtain a set of interesting WARs by varying the parameter values. Inspired by this, the present method supports both batch mode and interactive mode. In the batch mode, all qualified WARs are returned. On the other hand, in the interactive mode, a user not only can choose a set of interesting frequent itemsets to further investigate, but also can vary the support, interest, and density thresholds, so that only those qualified WARs for the interesting itemsets, are returned.
The present invention can be summarized as follows.
1) A new class of association rule problem, WAR, is proposed.
2) Due to the nature of this problem, the mining process is accomplished by a threefold approach:
a) First generating frequent itemsets and then deriving WARs from each frequent itemset.
b) During the frequent itemset generation process, the associated weight is ignored. An itemset is a frequent itemset if this itemset appears in at least a certain percentage of the transactions.
c) During the WAR derivation process, the weight domain space is partitioned in such a manner that the total number of grids is limited by some parameter N so that the available memory/storage space can be fully utilized. The concept of density is employed to separate transaction-concentrated, or dense, regions from the rest. An efficient ordered shrinkage algorithm is proposed to derive WARs from a high density region through shrinkages.