Various applications exist wherein a plurality of aspects of a process are measured and determined whether the result of the process is desirable or not. Such applications may include fault diagnosis of manufacturing lines for any product, analyzing credit risk for banks, mortgage, and credit card companies, analyzing potential insurance fraud, analyzing bank accounts for illicit activity such as money laundering, illegal International transfers, etc. In such applications, it is useful to infer conditions on the measurements (or other parameters of the application) that separate out the desirable results from the undesirable ones. These kinds of problems are called diagnostic inference problems. The ability to perform this inference often results in corrective actions that increase the probability of obtaining a desirable result.
In numerous applications, relational data bases record information about a domain. A relational data base, as known to those skilled in the art, is a data base which stores all its data within tables. All operations on data are conducted in the tables themselves or alternatively, a resulting table is produced. Each such table is a set of rows and columns described in J. D. Ullman, Principles of Data Base and Knowledge Base Systems, Computer Science Press, 1989.
In the relational data bases associated with applications, the data base tuples, which are the rows of the data base are used to record information about a particular entity. While the columns of the data base, represent the attributes, which are specific parameters of the analyzed entity. In order to separate out the rows with desirable results from the undesirable results, specific association rules are established which are applied to the relational data base of the application domain. The association rules are rules presentable in the form of C1→C2, where C1 and C2 are conditions that are used to determine whether the entity is desirable and where the condition C2 is not necessarily fixed.
The problem of association rule mining was first introduced in R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, In Proc. of ACM SIGMOD, 1993, pp. 207-216. The work was limited only to non-numeric data, and all association rules were found that exceed specified criterion such as lower bounds for support and confidence.
The body of work on association rules is extensive, and many aspects of the issue have been developed over the years. For example, in R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables”, In Proc. of ACM SIGMOD, pp. 1-12, 1996, a framework was introduced which was designed to find association rules in data sets that include numeric attributes. The authors present concepts of k-completeness and interest, which are used to reduce the number of rules that need to be considered explicitly and to eliminate redundant rules.
The primary weakness of the framework, however, is that as with R. Agrawal, et al. (supra), the framework relies just on the support and confidence lower bounds to determine which rules to select. Relying simply on support and confidence lower bounds poses problems of an excessive number of rules being returned for analysis, as well as the possibility of not returning a rule of interest if the bounds are not made too high.
A further critique of frameworks that rely just on support and confidence lower bounds has been presented in S. Brin, R. Motwani, and C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”; in Proc. of ACM SIGMOD, pp. 265-276, 1997. R. Srikant and R. Agrawal, “Mining Quantiative Association Rules in Large Relational Tables”; as well as in Proc. of ACM SIGMOD, pp. 1-12, 1996. However, these papers do not address the issue of the simplicity of rules.
The paper of R. J. Bayardo, R. Agrawal, D. Gunopulos, “Constraint-based Rule Mining in Large, Dense Databases”. In Proc. of ICDE, pp. 188-197, 1999 presents a framework that addresses the issue of rule simplicity. The paper proposes a notion of a rule improvement constraint, in which a more complicated rule is not returned if its improvement over a simpler rule is small. However the framework only applies to non-numeric data, and again relies heavily on support and confidence lower bounds.
Two other frameworks of note are Y. Aumann and Y. Lindell: “A Statistical Theory for Quantitative Association Rules”. In Proc. of ACM SIGKDD, pp. 261-270, 1999 and R. J. Miller and Y. Yang. “Association Rules Over Interval Data”. In Proc. of ACM SIGMOD, pp. 452-461, 1997 address finding association rules for numeric data. Neither framework uses the traditional definitions of support and confidence of rules, but both frameworks rely heavily on constraints to determine which rules to return.
Another approach to association rule mining is to find those rules that are optimal or near optimal according to some criteria. Representative papers in this area include S. Brin, S. R. Rastogi, and K. Shim. “Mining Optimized Gain Rules for Numeric Attributes”. In Proc. of ACM SIGKDD, pp. 135-144, 1999; T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. “Data Mining Using Two-Dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization”. In Proc. of ACM SIGMOD, pp. 13-23, 1996, and R. Rastogi, K. Shim “Mining Optimized Support Rules for Numeric Attributes”. In Proc. of ICDE, pp. 126-135, 1999. These papers study ways to efficiently find optimal association rules according to measures such as gain, support, and confidence in certain restricting settings.
Another paper dealing with optimal association rule mining, presents a partial ordering for association rules based on support and confidence. This framework is however disadvantageous in that it fails to consider the simplicity of conditions or to remove redundant rules. In addition, the framework is limited only to non-numeric attributes.
It would therefore be highly desirable to have a technique for optimal association rules mining which is applicable to both numeric and non-numeric attributes, and which would consider the simplicity of conditions in addition to support and confidence as well as to optimize efficiency by removing redundant rules. It also would be highly desirable that this technique would involve mining not just of one rule at a time, but mining of a set of k rules for some number k.