The present invention relates to mining of generalized disjunctive association rules. It relates generally to data processing, and more particularly to xe2x80x9ccomputer database miningxe2x80x9d in which association rules are discovered. In particular, this invention introduces the concept of a disjunctive association rule, a generalized disjunctive association rule and provides an efficient way to compute them.
Let I={i1, i2 . . . , im} be a set of literals, called items. Let D be a set of transactions where each transaction t is a subset of the set of items I. We say that a transaction t contains X(XI), if Xt. We use T(X) to denote the set of all transactions that contain X. An association rule is an implication of the form XY, where X⊂I, Y⊂I and X∩Y=xcfx86. The rule XY holds in the transaction set D with confidence c if c % of transactions in L that contain X also contain Y. The rule XY has support s in the transaction set if s% of transactions in D contain X 4 Y. Given a set of transactions D, the problem of mining association rules is to generate all association rules that have support and confidence xe2x80x9cgreater than the user-specified minimum support (minsupp) and minimum confidence (minconf) respectively [1,2,3]. In what follows, we use xe2x80x98itemxe2x80x99 and xe2x80x98attributexe2x80x99 interchangeably.
Mining algorithms have received considerable research attention. In one approach [2] the authors take into account the taxonomy (is-a hierarchy) on the items, and find associations between items at any level of the taxonomy. For example, given a taxonomy that says that jackets is-a outerwear is-a clothes, we may infer a rule that xe2x80x9cpeople who buy outerwear tend to buy shoesxe2x80x9d. This rule may hold even if rules that xe2x80x9cpeople who buy jackets tend to buy shoesxe2x80x9d, and xe2x80x9cpeople who buy clothes tend to buy shoesxe2x80x9d do not hold. Users are often interested only in a subset of association rules. For example, they may only want rules that contain a specific item or rules that contain children of a specific item in a hierarchy. In [3], the authors consider the problem of integrating constraints that are boolean expressions over the presence or absence of items into the association discovery algorithm.
Instead of applying these constraints as a post-processing step, the integrate constraints into the algorithm, which reduces the execution time.
So far, knowledge discovery in data mining has focussed on association rules with conjuncts (ABxe2x86x92XY) only. Specifically, traditional association rules cannot capture contextual inter-relationships among attributes.
U.S. Pat. Nos. 5,794,209 and 5,615,341 describe a system and method for discovering association rules by comparing the ratio of the number of times each itemset appears in a dataset to the number of time particular subsets of the itemset appear in the database, in relation to a predetermined minimum confidence value. The specified system and method however are limited in the use of operators for defining the association rules. Logical completeness of association rule discovery requires a functionally complete set of operators ([ (and),  (or), (not)], [⊕(xor), ].
Furthermore, the method does not utilize contextual information to define the association rules and is therefore limited in the effectiveness of the result. U.S. Pat. No. 5,615,341 is further limited by the use of hierarchical taxonomies in the determination of the association rules.
The object of this invention is to provide a system and method for mining a new kind of rules called disjunctive association rules for analyzing data and discovering new kind of relationships between data items.
Another object of the present invention is to incorporate the , , as well as the ⊕ operators in the discovery of the disjunctive association rules.
To achieve the said objective this invention provides A method for mining data characterized in that it generates generalized disjunctive association rules to capture the relationships between data items with reference to a given context to provide improved data analysis independently of taxonomies, comprising the steps of:
generating a list of all possible data items that can influence said context,
discovering association rules for data items in said that co-occur based on a defined overlap threshold within said context,
clustering said data items to form a set of generalized disjunctive rules based on a defined confidence (and/or support) threshold, and
iterating the above steps until all items in said list are covered.
The said list is generated by selecting those data items that have a significant overlap with said context.
The said association rules are discovered by merging data items that overlap above said defined threshold within said context and confirmation that the strength of the relation is beyond a defined minimum support value.
The Clustering is agglomerative.
The discovery of said association rules uses a functionally complete set of operators including xe2x80x9cANDxe2x80x9d, xe2x80x9cORxe2x80x9d, NOTxe2x80x9d and xe2x80x9cEXCLUSIVE-ORxe2x80x9d.
The above method is applied to clustering of query results in a search engine where the query is the context, a word is mapped to an item, a document to a transaction, the recall is the confidence, and the resulting disjuncts are the labels of the clusters of documents.
The said method is extended to interactive query refinement.
The above method is applied to customer targeting by determining generalized disjunctive association rules on data such as customer purchase history, customer segments, product information and the like.
The above method is further used for making recommendations to customers where the customer""s purchase history is the context and the generalized disjunctive association rules provide the recommendations.
The above method is applied to gene analysis by finding the generalized disjunctive association rules from gene databases.
The instant method is applied to cause-and-effect analysis in applications such as medical analysis, market survey analysis and census analysis, by finding generalized disjunctive association rules from the database of causes and effects.
The method is applied to fraud detection by finding generalized disjunctive association rules from transaction databases.
The present invention further relates to a system for mining data characterized in that it generates generalized disjunctive association rules to capture the relationships between data items with reference to a given context to provide improved data analysis independently of taxonomies, comprising:
means for generating a list of all possible data, items that can influence said context,
means for discovering association rules for data items in said list that co-occur based on a defined overlap threshold within said context,
means for clustering said data items to form a set of generalized disjunctive rules based on a defined confidence (and/or support) threshold, and
means for iterating the above steps until all items in said list are covered.
The said list is generated by means for selecting those data items that have a significant overlap with said context.
The said association rules are discovered by means for merging data items that overlap above said defined threshold within said context and confirmation that the strength of the relation is beyond a defined minimum support value.
The said clustering is agglomerative.
The discovery of said association rules uses a functionally complete set of operators including xe2x80x9cANDxe2x80x9d, xe2x80x9cORxe2x80x9d, NOTxe2x80x9d and xe2x80x9cEXCLUSIVE-ORxe2x80x9d.
The above system is used for clustering of query results in a search engine where the query is the context, a word is mapped to an item, a document to al transaction, the recall is the confidence, and the resulting disjuncts are the labels of the clusters of documents.
The system is extended to interactive query refinement.
The said system is used for customer targeting by means for determining generalized disjunctive association rules on data such as customer purchase history, customer segments, product information and the like.
The said system is used for making recommendations to customers where the customer""s purchase history is the context and the generalized disjunctive association rules provide the recommendations.
The above system is further used for gene analysis by means for finding the generalized disjunctive association rules from gene databases.
The above system is also used for cause-and-effect analysis in applications such as medical analysis, market survey analysis and census analysis, by means for finding generalized disjunctive association rules from the database of causes and effects.
The system is used for fraud detection by means for finding generalized disjunctive association rules from transaction databases.
The instant invention further provides a computer program product comprising computer readable program code stored on computer readable storage medium embodied therein for mining data characterized in that it generates generalized disjunctive association rules to capture the relationships between data items with reference to a given context to provide improved data analysis independently of taxonomies, comprising
computer readable program code means configured for generating a list of all possible data items that can influence said context,
computer readable program code means configured for discovering association rules for data items in said list that co-occur based on a defined overlap threshold within said context,
computer readable program code means configured for clustering said data items to form a set of generalized disjunctive rules based on a defined confidence (and/or support) threshold, and
computer readable program code means configured for iterating the above steps until all items in said list are covered.
The said list is generated by computer readable program code means configured for selecting those data items that have a significant overlap with said context.
The said association rules are discovered by computer readable program code means configured for merging data items that overlap above said defined threshold using traditional association rules within said context and confirmation that the strength of the relation is beyond a defined minimum support value.
The said clustering is agglomerative.
The discovery of said association rules uses a functionally complete set of operators including xe2x80x9cANDxe2x80x9d, xe2x80x9cORxe2x80x9d, NOTxe2x80x9d and xe2x80x9cEXCLUSIVE-ORxe2x80x9d.
The above computer program product is configured for clustering of query results in a search engine where the query is the context, a word is mapped to an item, a document to a transaction, the recall is the confidence, and the resulting disjuncts are the labels of the clusters of documents.
The said computer program product is extended to interactive query refinement
The above computer program product is configured for customer targeting by computer readable program code means configured for determining generalized disjunctive association rules on data such as customer purchase history, customer segments, product information and the like.
The instant computer program product is configured for making recommendations to customers where the customer""s purchase history is the context and the generalized disjunctive association rules provide the recommendations.
The computer program product is configured for gene analysis by computer readable program code means configured for finding the generalized disjunctive association rules from gene databases.
The above computer program product is configured for cause-and-effect analysis in applications such as medical analysis, market survey analysis and census analysis, by computer readable program code means configured for finding generalized disjunctive association rules from the database of causes and effects.
The instant computer program product is configured for fraud detection by computer readable program code means configured for finding generalized disjunctive association rules from transaction databases.