1. Field of the Invention
This invention relates in general to computer implemented data mining, and in particular to using object relational extensions for mining association rules.
2. Description of Related Art
There has been a rapid growth in the automation of data collection procedures in the last decade. This has led to a vast growth in the amount of usable data. Translating this usable data to useful information requires the use of a variety of data mining and knowledge extraction techniques. Accompanying these developments has been the growth of reliable, highly optimized relational database systems. As more and more data stores begin to rely on these database systems, the integration of the mining techniques with the database systems becomes desirable. However, efficient utilization of database systems as mining engines requires some modifications to the relational database system and to data organization.
Data mining is the process of finding interesting patterns in data. Data mining retrieves interesting data from a very large database, such as a database describing existing, past, or potential clients that may have thousands of attributes. A database is a set of records that are described by a set of attributes which have values.
Conventional data mining techniques do not work well on a database with a large number of attributes. In particular, most conventional data mining techniques only work one data in memory. Therefore, if the data is so large that it must be stored other than in memory, the data mining techniques will move data into memory to operate on the data, which is inefficient both in terms of memory usage and time.
The successful automation of data collection and the growth in the importance of information repositories have given rise to numerous data stores, ranging from those of large scientific organizations, banks and insurance companies, to those of small stores and businesses. The abundance of data has required the use of innovative and intricate data warehousing and data mining techniques to summarize and make use of the data.
There has been significant activity in developing new techniques for knowledge extraction, which is described further in G. Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery in Databases, AAAI/MIT Press, 1991, which is incorporated by reference herein. Some of the new techniques are for classification of data, which is further described in S. M. Weiss and C. A. Kulikowski, Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems, Morgan-Kauftnann, 1991; R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami, An Interval Classifier For Database Mining Applications, Proceedings of the 18th International Conference on Very Large Databases, pages 560-573, August 1992; which are incorporated by reference herein.
Some of the new techniques for knowledge extraction are for clustering of data. T. Zhang, R. Ramakrishnan, and M. Livny, Birch, An Efficient Data Clustering Method For Very Large Databases, Proceedings of the 1996 ACM SIGMOD International Conference of Management of Data, 1996; R. T. Ng and J. Han, Efficient And Effective Clustering Methods For Spatial Data Mining, Proceedings of the 20th International Conference on Very Large Databases, 1994; A. K. Jain and R. C. Dubes, Techniques For Clustering Data, Prentice-Hall, 1988; L. Kaufman and P. J. Rousseeuw, Finding Groups In Data--An Introduction To Cluster Analysis, Wiley, 1990, which are incorporated by reference herein.
Some of the techniques for knowledge extraction are for discovery of association rules, and association rules are derived from and used to represent frequently occurring patterns within the database. R. Agrawal., T. Imielinski, and A. Swami, Mining Association Rules Between Sets Of Items In Large Databases, Proceedings of SIGMOD '93, pages 207-216, May 1993; R. Agrawal and R. Srikant, Fast Techniques For Mining Association Rules, Proceedings of the 20th International Conference on Very Large Databases, September 1994, [hereinafter "Fast Techniques For Mining Association Rules"]; M. Houtsma and A. Swami, Set-Oriented Mining Of Association Rules, Technical Report RJ 9567, IBM Almaden Research Center, October 1993, [hereinafter "Set-Oriented Mining of Association Rules"]; J. S. Park, M. S. Shen, and P. S. Yu, An Effective Hash Based Technique For Mining Association Rules, Proceedings of SIGMOD '95, May 1995; R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, Fast Discovery Of Association Rules, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, edited by U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1995; H. Toivonen, Sampling Large Databases For Association Rules, Proceedings of the 22nd International Conference on Very Large Databases; A. Savasere, E. Omiecinski, and S. Navathe, An Efficient Technique For Mining Association Rules In Large Databases, Proceedings of the 21 st International Conference on Very Large Databases, September 1995; H. Mannila, H. Toivonen, and A. I. Verkamo, Efficient Techniques For Discovering Association Rules, Technical Report WS-94-03, American Association for Artificial Intelligence, 1994; R. Srikant and R. Agrawal, Mining Generalized Association Rules, Proceedings of the 21 st International Conference on Very Large Databases, September 1995; J. Han and Fu, Discovery Of Multiple-Level Association Rules From Large Databases, Proceedings of the 21st International Conference on Very Large Databases, September 1995; J. Han, Y. Cai, and N. Cercone, Data--Driven Discovery Of Quantitative Rules In Relational Databases, IEEE Transactions on Knowledge and Data Engineering, Vol. 5(1), pages 29-40, 1993; R. Srikant and R. Agrawal, Mining Ouantitative Association Rules In Large Relational Tables, Proceedings of the 1996 ACM SIGMOD International Conference of Management of Data, 1996; T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Mining Optimized Association Rules For Numeric Attributes, Proceedings of the 1996 ACM Symposium on Principles of Database Systems, 1996; R. Miller and Y. Yang, Association Rules Over Interval Data, Proceedings of SIGMOD '97, 1997; which are incorporated by reference herein.
Some of the techniques for knowledge extraction are for sequential patterns. R. Agrawal and R. Srikant, Mining Sequential Patterns, Proceedings of the 11th International Conference on Data Engineering, March 1995, which is incorporated by reference herein. Some of the techniques for knowledge extraction are for similarities in ordered data. R. Agrawal, C. Faloutsos, and A. Swami, Efficient Similarity Search In Sequence Databases, 4th International Conference on Foundations of Data Organization and Techniques, October 1993; C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, Fast Subsequence Matching In Time-Series Databases, Proceedings of SIGMOD '94, May 1994; R. Agrawal, K. I. Lin, H. S. Sawhney, and K. Shim, Fast Similarity Search In The Presence Of Noise, Scaling And Translation In Time-Series Databases, Proceedings of the 21st International Conference on Very Large Databases, September 1995; R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait, Querying Shapes Of Histories, Proceedings of the 21st International Conference on Very Large Databases, September 1995; which are incorporated by reference herein.
Houtsma and Swami, in "Set-Oriented Mining of Association Rules", had proposed SETM, an SQL based technique for association. Their technique uses simple database operations (e.g., sorting and merge-scan joins) for performing association. However, their joins are more expensive as they are against the input data table and they do not have an efficient candidate set pruning such as the Apriori technique.
The size and growth of the data stores, matched by the growing reliability and large-volume handling capability of relational database systems, has caused much of the data to be managed by these database systems. The enhancement of database systems for query optimizations and parallelization and their widening portability across a multitude of system architectures, has made the integration of data mining techniques with the database system an attractive proposition. The integration of data mining applications and database systems, however, requires appropriate data organization, some modifications and/or enhancements in the database systems, and either changes in or entirely new data mining techniques.
A very important data mining application is "association" from a database performance perspective. An association rule is a grouping of attribute value pairs. The problem of mining for association rules was introduced initially for market-basket analysis. In market-basket analysis, the association rules provided associations between the set of items purchased together in a transaction. In general, an association rule has the form A{character pullout}B, where A and B are two disjoint sets of items. The association rule conveys that the occurrence of set A in a transaction implies that the set B also occurs in the same transaction.
The term support is used to refer to the frequency of observation of such a rule in the data. Support of a rule is a measure of frequency of the rule, which is defined as the ratio of transactions supporting the rule to the total number of transactions in the database, where a transaction is a collection of attribute-value pairs. For example, for attribute value pairs attribute2-value2 and attribute5-value5, if these attribute value pairs occurred five percent of the time in the database, then support of the rule is said to be five percent.
The term confidence is used to refer to the fraction of transactions that contain A and also contain B. Thus, support is the joint probability for A and B to occur together in a transaction, and confidence is the conditional probability for B to be found in a transaction given that A is found in it. For the generation of such rules from data mining, the user provides the minimum required support and confidence values. Then, all rules that have at least the minimum required support and confidence are generated.
A study of association shows that the physical data model used for the data input to the technique that generates association rules has a significant impact on the performance of the technique. A common physical data model used in market-basket analysis, referred to as single-column or SC data model, is inefficient when data resides in the database. For example, with the SC data model, a transaction involved the purchase of three items, then the SC data model would represent the transaction in a table with three rows, each row identifying the transaction and one of the three items. There is a significant performance degradation with the use of the SC data model to generate association rules because every item purchased in a transaction is represented by a single column.
There is a need in the art for an improved technique for generating mining association rules with an improved data model.