1. Field of the Invention
The present invention relates generally to the field of data mining, and more particularly, a novel data mining system and depth first search methodology for generating associations among items in a large database.
2. Discussion of the Prior Art
The problem of finding association rules was introduced in a reference entitled xe2x80x9cMining Association Rules Between Sets of Items in Very Large Databases,xe2x80x9d Proceedings of the ACM SIGMOD Conference on Management of Data, pages 207-216, 1993 authored by Agrawal R., Imielinski T., and Swami A. The problem identified in the reference was directed to finding the relationships between different items in a large database, e.g., a database containing customer transactions. Such information can be used for many sales purposes such as target marketing, because the buying patterns of consumers can be inferred from one another.
As described in the above-mentioned reference, there is first identified a set {I} comprising all items in the database of transactions. A transaction {T} which is a subset of {I} is defined to be a set of items which are bought together in one operation. An association rule between a set of items {X} which is a subset of {I} and another set {Y} which is also a subset of {I} is expressed as {X}xe2x86x92{Y}, and indicates that the presence of the items X in the transaction also indicates a strong possibility of the presence of the set of items Y. The measures used to indicate the strength of an association rule are support and confidence. The support of the rule Xxe2x86x92Y is the fraction of the transactions containing both X and Y. The confidence of the rule Xxe2x86x92Y is the fraction of the transactions containing X which also contain Y. In the association rule problem, it is desired to find all rules above a minimum level of support and confidence. The primary concept behind most association rule algorithms is a two phase procedure: In the first phase, all frequent itemsets (or large itemsets) are found. An itemset is xe2x80x9cfrequentxe2x80x9d or large if it satisfies a user-defined minimum support requirement. The second phase uses these frequent itemsets in order to generate all the rules which satisfy the user specified minimum confidence.
Since its initial formulation, considerable research effort has been devoted to the association rule problem. A number of algorithms for large itemset generation have been proposed, such as those found in Agrawal R., Mannila H., Srikant R., Toivonen H., and Verkamo A. I. xe2x80x9cFast Discovery of Association Rules.xe2x80x9d Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Chapter 12, pages 307-328. Proceedings of the 20th International Conference on Very Large Data Bases, pages 478-499, 1994. and Brin S., Motwani R. Ullman J. D. and Tsur S., xe2x80x9cDynamic Itemset Counting and Implication Rules for Market Basket Data.xe2x80x9d Proceedings of the ACM SIGMOD, 1997. pages 255-264. Variations of association rules such as generalized association rules, quantitative association rules and multilevel association rules have been studied in Srikant R., and Agrawal R., xe2x80x9cMining Generalized Association Rules.xe2x80x9d Proceedings of the 21st International Conference on Very Large Data Bases, 1995, pages 407-419, and, Srikant R., and Agrawal R. xe2x80x9cMining Quantitative Association Rules in Large Relational Tables,xe2x80x9d Proceedings of the ACM SIGMOD Conference on Management of Data, 1996, pages 1-12.
Although there are many previously proposed methods and systems, there is no efficient method which can generate large itemsets for very large scale problems. For these problems, current techniques require too much time to be of any practical use. The importance of solving such large scale problems is quite great, given the fact that most databases containing customer transaction data are quite large.
The present invention is directed to a computer-implemented system and method for mining long patterns in databases, which relies on a depth-first search technique in order to construct a lexicographic tree of itemsets. The depth first traversal of a tree is defined to be the process of traversing the nodes in the tree in such a way that after examining a node, all of that nodes descendants are examined before examining any other node. In depth-first strategy, the nodes of the lexicographic tree are examined in depth-first order in order to count the supports of their candidate extensions. In other words, all descendant itemsets of a node are examined before examining other nodes of the lexicographic tree. By projecting transactions upon the lexicographic tree structure in a depth-first manner, the CPU time for counting large itemsets is substantially reduced.
Novel and innovative counting methods are implemented including byte-wise counting and postprocessing for higher level nodes in the lexicographic tree structure; and, a two phase bucketing procedure, for counting support at lower level nodes. Use of these counting methods efficiently optimizes processing time and database storage utilization.