In recent years, commercial businesses have been increasing the use of information-driven marketing processes, managed by database technology, to develop and implement customized marketing strategies and programs. The progress of information automation has increased the size of commercial computer databases to the point where enormous amounts of commercial numbers, facts and statistics are collected and stored; unfortunately less information of any significance is being extracted from such databases because their size has become less and less manageable. The problem is that conventional computer databases are efficient in the manner in which they store data, but inefficient in the manner of searching through data to extract useful information. Simply stated, the use of computers in business and network applications has generated data at a rate that has far outstripped the ability to process and analyze it effectively.
Data "mining" or knowledge discovery in databases, has been growing in response to this problem because computer systems cannot efficiently and accurately undertake the intuitive and judgmental interpretation of data. Computer systems can, however, undertake the quantitative aspects of data mining because they can quickly and accurately perform certain tasks that demand too much time or concentration from humans. Data mining systems are ideally suited to the time-consuming and tedious task of breaking down vast amounts of data to expose categories and relationships within the data. These relationships can then be intuitively analyzed by human experts.
Data mining systems identify and extract important information from patterns or relationships contained in available databases by sifting through immense collections of data such as marketing, customer sales, production, financial and experimental data to "see" meaningful patterns or regularities and identify what is worth noting and what is not. For example, credit card companies, telephone companies and insurers are mining their enormous collections of data for subtle patterns within thousands of customer transactions to identify risky customers or even fraudulent transactions as they are occurring. Data mining is also being used to analyze the voluminous number of alarms that occur in telecommunications and networking alarm data. Progress in bar code technology use at retail organizations, such as supermarkets, has resulted in millions of electronic records which, when mined, can show purchasing relationships among the various items shoppers buy. Analysis of large amounts of supermarket basket data (the items purchased by an individual shopper) can show how often items are purchased together, such as, for example, milk, bread and butter. The results can be useful for decisions concerning inventory levels, product promotions, pricing, store layout or other factors that might be adjusted to changing business conditions.
Consider data mining of supermarket basket data. In such a situation, the supermarket contains a set of items (its products), of which each shopper transaction or purchase is a subset. In analyzing the volumes of subsets, it is desirable to find the transactions in which the presence of various items occurs a significant percentage of times. The fraction of transactions that a particular set of items (also referred to as an "itemset") occurs in, is known as the support of an itemset. An itemset is called large if its support exceeds a preselected threshold. All other combinations are known as small itemsets. The fraction of transactions containing one itemset I, that also contain another specific itemset J is known as the confidence. For example, in a market basket analysis of shopper transactions, if 60% of the transactions that contain milk also contain bread, and 15% of all transactions contain both of these items, then 15% is the support and 60% is the confidence.
The objective of data mining systems is to uncover relationships or associations between the presence of various itemsets in transactions based on support and confidence factors (called "association rules"). The end result of a data mining operation is the generation of association rules that satisfy user-specified minimum support and confidence constraints for itemsets. These rules are formulated probability rules that are indicative of the frequency association between different items uncovered in the multitude of records.
One of the better known methods for finding large itemsets is the Apriori method described in the publication, Fast Algorithms of Mining Association Rules, by R. Agrawal and R. Srikant--Proceedings of the 20.sup.th VLDB Conference; Santiago, Chile, 1994. To discover large itemsets, the Apriori method makes multiple passes over the transaction records and counts the support of individual items to determine which of them are large, i.e., have minimum support and which of them are small. In each subsequent pass, this method starts with a seed set of itemsets found to be large in the previous pass. This seed set is used for generating new potentially large itemsets, called "candidate" itemsets, and the actual support for these candidate itemsets are counted during the pass over the data. At the end of the pass over the transactions, the candidate itemsets that are actually large are identified, and they become the seed for the next pass.
A fundamental premise of the Apriori method is that any subset of a large itemset must also be large. Therefore, candidate large itemsets can be generated by joining itemsets already found to be large, and eliminating those large candidate itemsets that contain a subset which has not been found to be large. This process continues, pass after pass over the data, until no new large itemsets are found. Association rules are constructed for itemsets which exceed the confidence threshold from the large itemsets uncovered.
One shortcoming of the Apriori method is that as the size of the database increases, the number of items searched increases, as does the number of association rules that are generated. In very large databases, the user is left a large amount of quantitative association information. However, in practice users are often interested in only a subset of associations, for instance, those containing items from a subset of items that have very different levels of importance. In the market basket example, some items like caviar or lobster are of much higher value than items such as candy. Association rules involving {lobster, caviar} will have less support than those involving candy, but are much more significant in terms of profits earned by the store. Under the Apriori method, the itemset {lobster, caviar} is of low support and will not be included in the association rules that are uncovered.
A more recent data mining technique that attempts to avoid some of the limitations of the Apriori method is that disclosed by H. Toivonen in the paper, Sampling Large Databases for Association Rules, H. Toivonen, Proceedings of the 22.sup.nd VLDB Conference, Bombay, India, 1996. Toivonen presents a database mining method which randomly picks a sample record from the database, uses it to determine the relationship or pattern on the assumption that it probably holds for the entire database, and then verifies the results with the rest of the database.
The method uses the random sample and makes a series of passes over the data to determine which items are frequently found. Each pass builds on the previous collection of frequently found items until the method finds a superset from the collection of frequently found subsets. This approach attempts only one full pass over the database, and two passes in the worst case. In order to increase accuracy, the method is fairly conservative in its estimation, so it must count many more itemsets than are actually required in one pass.
This method uses a random sample of the relation to find approximate associations, and applies those results to the entire database. The significant shortcoming of the Toivonen method, however, is that it also results in a large volume of association rules that militates accurate interpretation, and lacks the ability for user-defined value attributes for the itemsets.
In most problem domains, it does not make sense to assign equal importance to all of the items involved in the data mining analysis. Understandably, existing methods for generating association rules in practical data mining applications suffer from two basic drawbacks: (i) the volume of results is typically very large and it is hard for the user to draw conclusions from the numerous association rules which are produced, and (ii) certain results, produced from itemsets in which the individual items or transactions have very different levels of importance, are not included.
Because of the shortcomings of the current data mining techniques, what is needed is a method and apparatus of accurately finding large itemsets while providing the user the ability to assign distinct values or attributes to different items or transactions in the database, and thereby provide more qualitative association rules.
Accordingly, it is an object of the present invention to provide a data mining method and apparatus that provides preselected value weights to items and/or transactions to generate association rules that meter user-defined thresholds of importance.
It is still another object of the present invention to accomplish the above-stated object by utilizing a data mining method and apparatus which is simple in design and use, and efficient to perform with regard to database activity.
The foregoing objects and advantages of the invention are illustrative of those that can be achieved by the present invention and are not intended to be exhaustive or limiting of the possible advantages which can be realized. Thus, these and other objects and advantages of the invention will be apparent from the description herein or can be learned from practicing the invention, both as embodied therein or as modified in view of any variation which may be apparent to those skilled in the art. Accordingly, the present invention resides in the novel methods, arrangements, combinations and improvements herein shown and described.