Analyzing the correlation among data sets in a database to discover a significant association rule among attributes is called "data mining."
The fact that a customer has purchased a commodity A or has a credit card can be considered as data with a 0-1 attribute which can be indicated by 1 or 0. The values 1 and 0 represent, respectively, whether or not the customer has purchased a commodity A (or whether or not the customer has a credit card in the case of the credit card example). Attempts have been made to determine a rule from the correlation based on the 0-1 attribute. For example, R. Agrawal, T. Imielinski and A. Swami, in "Mining association rules between sets of items in large databases," Proceedings of the ACM SIGMOD Conference on Management of data, May 1993, and R. Agrawal and R. Srikant, in "Fast algorithms for mining association rules," Proceedings of the 20th VLDB Conference, 1994, describe methods for determining an association rule indicating that "a ratio r of the customers who have purchased a commodity A have also purchased a commodity B."
In conventional relational databases, their query languages can be used to provide a numerical attribute A and an interval I in order to easily determine X in, for example, the question that "X% of the data with A the value of which is included in I has a 0-1 attribute B." In this case, however, the interval I must be input. Current database systems do not have a function for outputting the interval I. This is because the association rule between the numerical attribute and a set of intervals based on the numerical attribute and the 0-1 attribute has a very large searching space.
For example, given a database for data on bank customers, it is very useful to be able to determine an interval I that meets an association rule for a combination of a numerical attribute (e.g., an increase in the amount of a fixed deposit) and an 0-1 attribute (e.g., whether or not a credit card is used). The association rule may then be used for determining, for example, a percentage X of those customers whose increase in the balance of a fixed deposit is included in the interval I use a credit card. There are many intervals I that meet this association rule depending on the minimum range of X or the interval I. However, if this association rule is modified to the rule that customers whose increase in the balance of a fixed deposit is included in the interval I, that includes T% or more of all the customers, are most likely to have a credit card, then the interval I can be substantially uniquely determined. The determination of the interval I is very useful because the largest class of customers who use a credit card can be determined so that the number of direct mails to be sent can be kept to a minimum. Thus, advertizing costs are minimized.
The above inquiries are also applicable to databases with a large number of data sets, so it is essential to be able to process such a large number of data sets in practical time.
It is thus an objective of this invention to enable the determination of the correlation among data sets with a numerical attribute and a 0-1 attribute.
It is another objective of this invention to execute the above processing at a high speed.
It is yet another objective of this invention that if the rate of data sets with their numerical attribute z included in the interval I=r1, r2! is defined as a support for the interval I, and the rate of data sets with their numerical attributes z included in the interval I which have a 0-1 attribute (a) of 1 is defined as a degree of confidence, then the interval I with both a maximum degree of confidence and a support of T or larger can be determined. This interval I is referred to as an optimized confidence rule.