This application is based on Japanese Patent Application No. 9-341384, filed Dec. 11, 1997, and Japanese Patent Application No. 10-102383, filed Apr. 14, 1998, the contents of which are incorporated herein by reference.
The present invention relates to a distributed shared memory system suitably applied to a multiprocessor system of a shared memory type that executes large-scale data mining, for example, on the TB (terabyte) order, and a method of controlling the distributed shared memory.
With recent advances in bar code techniques and the like, retailers such as supermarkets store a large volume of sales data. Advanced retailers analyze such sales data stored in large volume, and reflect the analysis result in store layouts, thereby increasing sales. Such a technique is generally called data mining.
Of various information obtained by data mining, most typical information is an association rule. For example, an association rule includes the information "50% of the customers who buy packs of paper diaper also buy cans of beer". This is an example associated with supermarkets in the U.S.A. This association rule indicates that in the U.S.A., young fathers often buy packs of paper diaper, and hence buy cans of beer together. In accordance with this information, therefore, for example, packs of paper diaper and cans of beer are placed near to increase the sales of cans of beer. A method of obtaining such an association rule is disclosed in R. Agrawal et al., "Mining Association Rules between Sets of Items in Large Databases", Proceedings of ACM SIGMOD, May 1993. This method will be briefly described below.
Let I={i1, i2, . . . , im} be a set (item) of attributes, and D={t1, t2, . . . , tn} be a transaction database. In this case, ti is a set of items. An association rule is defined as X.gtoreq.Y. In this case, X and Y are subsets of I, and the common set of X and Y is an empty set. Two evaluation values referred to as support and confidence values will be defined. A support value indicates the ratio of X to D, and a confidence value indicates the ratio of transactions, which include both X and Y, to the transactions including X in D. An association rule is extracted by the following procedure.
(1) An item set that satisfies the minimum support value is detected (this item is called a frequent item set). PA1 (2) An association rule that satisfies the minimum confidence value is detected from the frequent item set obtained in (1). PA1 (1) A transaction database is read, and the appearance frequency of each item is counted up, thereby obtaining support values. In this case, to count up the appearance frequency of each item is to count the number of times each item appears in the transaction database. Subsequently, "count up" indicates this. PA1 (2) Items that satisfy the minimum support value are extracted as a frequent item set having length 1. PA1 (3) Combinations of pairs of items are formed from the frequent item set having length 1. These combinations will be referred to as candidate item sets having length 2. PA1 (4) Support values are obtained by searching the transaction database. PA1 (5) Items that satisfy the minimum support value are extracted to form a frequent item set having length 2. PA1 (6) The following is the processing to be performed in the case of length k(.gtoreq.2). PA1 (a) A candidate item set having the length k is formed from a frequent item set having a length k-1. PA1 (b) Support values are obtained by searching the transaction database. PA1 (c) Items that satisfy the minimum support value are extracted to form a frequent item set having the length k. PA1 (7) The above processing is repeated until the frequent item set becomes empty. As described above, in conventional data mining, this Apriori algorithm is basically used to find association rules.
An example of how an association rule is extracted will be described below. Assume that T1={1, 3, 4}, T2={1, 2, 3, 5}, T3={2, 4}, T4={1, 2}, and T5={1, 3, 5} are set as transactions. An association rule that satisfies a minimum support value of 60% and a minimum confidence value of 60% is detected from these transactions. A frequent item set is {1}, {2}, {3}, and {1, 3}, and 1.gtoreq.3 is obtained as an association rule.
Apriori algorithm is known as a technique of efficiently extracting this frequent item set. Apriori algorithm is described in R. Agrawal et al., "Fast Algorithms for Mining Association Rules", Proceedings of 20th VLDB, 1994. This technique will be briefly described below.
Although this Apriori algorithm is efficient, since transaction data to be processed in data mining is on the TB order, large-volume transaction data cannot be processed. Even if such data can be processed, it takes an enormous processing time. For example, 1-TB transaction data corresponds to 500 2-GB (gigabyte) disk units. Even if an SMP computer is used, it is difficult to connect all the 500 disk units to one computer. Even if 500 disk units can be connected, problems arise in terms of I/O performance. For this reason, disk units storing transaction data on the TB order are preferably distributed to a plurality of nodes to be processed by using a cluster system. However, since Apriori algorithm is an algorithm for sequential processing, this algorithm does not operate on the cluster system. Even if this Apriori algorithm is improved to operate on a cluster system of a distributed memory type, the resultant system inevitably becomes a programming model of a distributed memory type accompanying communications. This makes it difficult to develop a data mining program. More specifically, a programming model of a shared memory type allows exclusive control using a lock mechanism. In the case of a programming model of a distributed memory type, however, since each processor cannot see an identical storage area in each distributed memory, the algorithm must be basically changed, and the program must be modified.