Analyzing the correlation among data sets in a database to discover a significant association rule among attributes is called "data mining".
The fact that a customer has purchased a commodity A or has a credit card can be considered as data with a 0-1 attribute. This attribute can be indicated by 0 or 1 to represent, respectively, whether or not the customer has purchased a commodity A or has a credit card. Attempts have been made to determine a rule from the correlation based on an 0-1 attribute. For example, R. Agrawal, T. Imielinski and A. Swami, "Mining association rules between sets of items in large databases", Proceedings of the ACM SIGMOD Conference on Management of data, May 1993, and R. Agrawal and R. Srikant, "Fast algorithms for mining association rules", Proceedings of the 20th VLDB Conference, 1994 describe methods for determinating an association rule representing that "a ratio r of the customers who have purchased a commodity A have also purchased a commodity B."
In conventional relational databases, their query languages can be used to provide a numerical attribute A and an interval I in order to easily solve the question, for example, that "X % of the data with A the value of which is included in I has a 0-1 attribute B". In this case, however, the interval I must be input. Current databases do not have a function for outputting the interval I. This is because the association rule between the numerical value and a set of intervals based thereon has a very large searching space.
For example, given a database for data on bank customers, it is very useful to be able to determine an interval I that meets an association rule for a combination of a numerical attribute (an increase in the amount of a fixed deposit) and a 0-1 attribute (whether or not a credit card is used). The association rule may then be used for determining the percentage of customers whose increase in the balance of a fixed deposit is included in the interval I also use a credit card. Although there are many intervals I, generally, the interval I with the largest number of customers can be uniquely determined. The determination of the interval I also allows information useful to other operations to be acquired.
Such questions, however, are applied to databases with a large number of data sets, so it is essential to be able to process such a large number of data sets in practical time.
It is thus an objective of this invention to provide a method for determining the correlation among data sets with a numerical attribute and a 0-1 attribute.
It is another objective of this invention to execute the above processing at a high speed.
It is yet another objective of this invention that if the rate of data sets with their numerical attribute z included in the interval I=[r1, r2] is defined as a support for the interval I, and the rate of those data sets with their numerical value included in the interval I whose 0-1 attributes (a) are 1 is defined as a confidence, the interval I with both a confidence of .alpha. % or more and the maximum support (which is referred to as a "dual association rule") can be determined.