1. Field of the Invention
The present invention relates generally to online searching for data dependencies in large databases and more particularly to an online method of data mining.
2. Discussion of the Prior Art
Data mining, also known as knowledge discovery in databases, has been recognized as an important new area for database research with broad applications. With the recent popularity of the internet the internet rule mining problem is significant because of its ability to gain access to large databases over the Internet. The ability to gain access to such large databases without significant access delay is a primary goal of an on-line data miner.
In general, data mining is a process of nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. The discovered knowledge can be applied to information management, query processing, decision making, process control, and many other applications. Furthermore, several emerging applications in information providing services, such as on-line services and the World Wide Web, also call for various data mining techniques to better understand user behavior, to meliorate the service provided, and to increase the business opportunities. Since it is difficult to predict what exactly could be discovered from a database, a high-level data mining query should be treated as a probe which may disclose some interesting traces for further exploration. Interactive discovery should be encouraged, which allows a user to interactively refine a data mining request for multiple purposes including the following; dynamically changing data focusing, flexibly viewing the data and data mining results at multiple abstraction levels and from different angles.
A data mining system can be classified according to the kinds of databases on which the data mining is performed. In general, a data miner can be classified according to its mining of knowledge from the following different kinds of databases: relational databases, transaction databases, object-oriented databases, deductive databases, spatial databases, temporal databases, multimedia databases, heterogeneous databases, active databases, legacy databases, and the Internet information-base. In addition to the variety of databases available, several typical kinds of knowledge can be discovered by data miners, including association rules, characteristic rules, classification rules, discriminant rules, clustering, evolution, and deviation analysis. Moreover data miners can also be categorized according to the underlying data mining techniques. For example, it can be categorized according to the driven method into autonomous knowledge miner, data-driven miner, query-driven miner, and interactive data miner. It can also be categorized according to its underlying data mining approach into generalization based mining, pattern based mining, mining based on statistics or mathematical theories, and integrated approaches, etc.
Given a database of sales transactions, it is desirable to discover the important associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. A mathematical model was proposed in Agrawal R., Imielinski T., and Swami A. Mining association rules between sets of items in very large databases, Proceedings of the ACM SIGMOD Conference on Management of data, pages 207-216, Washington D. C., May 1993, to address the problem of mining association rules.
Let U={i1, i2, . . . , im} be a set of literals called items. Let D be a set of transactions; where each individual transaction T consists of a set of items, such that T is a subset of U. Note that the actual quantities of items bought in a transaction are not considered, meaning that each item is a binary (0 or 1) variable representing if an item was bought. Let U be a set of items. A transaction T is said to contain the set of items U if and only if U is a subset of T.
An association rule is an implication or query of the form X==&gt;Y, where both X and Y are sets of items. The idea of an association rule is to develop a systematic method by which a user can figure out how to infer the presence of some sets of items, such as Y, given the presence of other items in a transaction, such as X. Such information is useful in making decisions such as customer targeting, shelving, and sales promotions.
The Rule X==&gt;Y holds in the transaction set D with confidence c if c % of transactions in D that contain X also contain Y. For example, a rule has 90% confidence when 90% of the tuples containing X also contain Y. The rule has support s if s % of transactions in D contain (X union Y). It is often desirable to pay attention to only those rules which may have reasonably large support. Such rules with high confidence and high support are referred to as association rules. These concepts were first introduced into the prior art, see Agrawal et al, infra. The task of mining association rules is essentially to discover strong association rules in large databases. The notions of confidence and support become very useful in formalizing the problem in a computational efficient approach called the large itemset method. The large itemset approach can be decomposed into the following two steps:
1) Discover the large item sets, i.e., the sets of item sets that have transaction support above a predetermined user defined minimum support, called minsupport. PA1 2) Use the large item sets to generate the association rules for the database that have confidence above a predetermined user defined minimum confidence called minconfidence. PA1 association rule generated PA1 play basketball==&gt;eat cereal PA1 play basketball==&gt;(not) eat cereal
Given an itemset S={I1, I2, . . . , Ik}, we can use it to generate at most k rules of the type [S-{Ir}]==&gt;Ir for each r in {1, . . . , k}. Once these rules have been generated, only those rules above a certain user defined threshold called minconfidence are retained.
The overall computational complexity of mining association rules is determined by the first step. After the large itemsets are identified, the corresponding association rules can be derived in a straightforward manner. Efficient counting of large itemsets was the focus of most prior work. Nevertheless, there are certain inherent difficulties with the use of these parameters in order to establish the strength of an association rule.
After the fundamental paper on the itemset method, see Agrawal et al. infra, a considerable amount of additional work was done based upon this approach. For example, faster algorithms for mining association rules were proposed in Agrawal R., and Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases, pages 478-499, September 1994.
A secondary measure called the interest measure was introduced in Agrawal et. al. in Srikant R., and Agrawal R. Mining quantitative association rules in large relational tables. Proceedings of the 1996 ACM SIGMOD Conference on Management of Data. Montreal, Canada, June 1996. A rule is defined to be R-interesting, if its actual support and confidence is R-times that of the expected support and confidence. It is important to note here that the algorithms previously proposed for using the interest measure are such that the support level remained the most critical aspect in the discoverability of a rule, irrespective of whether or not an interest measure was used.
One of the difficulties of the itemset method is its inability to deal with dense data sets. Conversely, the success of the itemset approach relies on the sparsity of the data set. For example, if the probability of buying soup were around 2%, such occurrence would be considered to be statistically sparse and therefore amenable to an itemset approach. This is because for a k-dimensional database, a database with k purchasable items, there are 2 k possibilities for itemsets. The sparsity of the dataset ensures that the bottleneck operation (which is the generation of large itemsets) is not too expensive, because only a few of those 2 k itemsets are really large. However, some data sets may be more dense than others, and in such cases it may be necessary to set the minisupport s, to an unacceptably high level, in which case a lot of important rules would be lost. The issue of dense data sets becomes even more relevant when we attempt to mine association rules based upon both the presence or absence of an item. Although the itemset approach can be extended to the situation involving both presence as well as the absence of items (by treating the absence of an item as a pseudo-item), the sparsity of item presence in real transaction data may result in considerable bias towards rules which are concerned only with finding rules corresponding to absence of items rather than their presence.
Another drawback of the itemset approach with respect to dense data sets occurs when trying to mine large itemsets corresponding to [0-1] categorical data mixed with sales transaction data. For example, while trying to find the demographic nature of people buying certain items, the problem of determining an appropriate support level may often arise.
Another potential problem in the itemset approach is the lack of direct applicability of support and confidence to association rule mining. In the itemset approach, the primary factor used in generating the rules is that of support and confidence. This often leads to misleading associations. An example is a retailer of breakfast cereal which surveys 5000 students on the activities they participate in each morning. The data shows that 3000 students play basketball, 3750 eat cereal, and 2000 students both play basketball and eat cereal. If the user develops a data mining program with minimal support, s=40%=2000/5000, and minimal confidence c=60%, the following association rule is generated:
The association rule is misleading because the overall percentage of students eating cereal is 75%, 3750/5000, which is even larger than 60%. Thus, although playing basketball and eating cereals are negatively associated, being involved in one decreases the chances of being involved in the other. In fact, if we consider the following association rule:
This rule has both lower support as well as lower confidence than the rule implying positive association, yet it is far more accurate. Thus, if we set the support and confidence sufficiently low, the two contradictory rules described above would be generated. On the other hand, if we set the parameters sufficiently high, the undesirable consequence of generating only the inaccurate rule would occur. In other words, no combination of user defined support and confidence can generate only the correct association.
The use of an interest measure somewhat alleviates the problem created by the spurious association rules in a transaction database. Past work has primarily concentrated on using the interest measure as a pruning tool in order to remove the uninteresting rules in the output. However, as the basketball-cereal example illustrates, as long as an absolute value of support is still the primary determining factor in the initial itemset generation, either the user has to set the initial parameter low enough so as no interesting rules are lost in the output or risk losing some important rules. In the former case, computational efficiency and computer memory may be a problem, while the latter case has the problem of not being able to retain rules which may be interesting from the point of view of a user. In either case, it is almost impossible to ascertain when interesting rules are being lost and when they are not.