1. Field of the Invention
The present invention relates to processing of a large amount of data normally stored in a database. Specifically, it relates first to a group-by process, and second to a database mining process.
The present invention relates to a process of classifying a large amount of records depending on a key value of each record and performing a specified operation on a group of records having the same key value, such as, for example, obtaining an average value. Such a process is referred to as a group-by process. The present invention relates more specifically to a group-by processing system based on a hash process, that is, a system of hashing a large volume of records according to a hash function value obtained by applying an appropriate hash function to a key value, generating a list of the hashed records, sorting the hashed records in the list according to the key value, and performing a group-by process on the sorted records of the resultant list.
The present invention also relates to a database mining process for obtaining a rule of the relationship among data stored in a database, and more specifically to a system of counting occurrences of a combination of related data in a large volume of data in the database. According to a count result of the system, a process of generating an association rule is performed as one of the data mining methods using a combination which meets a given condition and the number of occurrences of the combination. The association analysis based on the association rule is carefully considered in the U.S. and many other nations.
2. Description of the Related Art
Normally, an operation in the group-by process, that is, an operation on a group of records having the same key value, for example, the same item number, can be a count process for counting records such as item sets, an operation for computing a total or average of specific field values of the group of records, etc. These group-by processes are frequently performed in a relational database process, a statistical process, etc.
A group-by processing system can be based on a sort process or a hash process. A system based on the sort process is performed by continuously accessing records having the same key value by sorting a group of records according to the key values. That is, the group of records are first sorted according to the key values, and a list of resultant sorted records is searched from the beginning. A specified operation is repeated for the records having the same key value. When the key value changes, the operation is initialized.
In a hash process, a group of records are read into an input buffer, and hash function values, which respectively correspond to key values of the records to be hashed, are calculated by using a hash function. Then, each of the records in the input buffer is stored in one of record buffers according to the hash function value of the record. If a record buffer, which corresponds to one of the hash function values, is filled with records, the records in the record buffer are output into one of the hashed lists, which respectively correspond to a record buffer, in a secondary storage.
FIG. 164 is a flowchart showing an example of a conventional group-by processing system based on the sort process. When the process starts as shown in FIG. 164, a group of records to be processed in a group-by process is sorted according to key values in step S201. In step S202, the leading record in the group of records is read. In step S203, a function is initialized. In step S204, an operation of the function is performed on the read record. In step S205, it is determined whether or not any record still exists in the group of the sorted records.
When a record exists, the leading record in the group of the sorted records is read in step S206. In step S207, it is determined whether or not the key value of the record is equal to the key value of the previously read record. When the key values are equal to each other, the processes in and after steps S204 are repeated.
When the key value of a record is not equal to that of the previous record, a termination process is performed on the function in step S208. In step S209, the result of the function process and the record are output as a resultant record, and the processes in and after step S203 are repeated.
If it is determined in step S205 that the group of the sorted records have been read, then the termination process is performed on the function in step S210, and the result of the function process performed on the previous record and the record are output as a resultant record in step S211, and the process terminates.
FIGS. 165A through 165F are practical examples showing the proceedings of the group-by process performed according to the flowchart shown in FIG. 164. As shown in FIG. 165A, a group of records to be processed in a group-by process contains 10 records, and each record comprises only a key value, for example, an item number for simple representation. When the process in step S201 is completed, the state shown in FIG. 165B is realized.
In this example, the operation in the group-by process is a count process for obtaining the number of records having the same key value. When the processes in steps S202 and S203 terminate, the state shown in FIG. 165C is entered. That is, in the initialization of the function shown in step S203, the count value is set to 0.
In step S204, a count value is incremented only by 1 to enter the state shown in FIG. 165D. The determination in step S205 is YES, and in step S206, it is determined that the current record `1` is equal to the previous record, and the new record `1` becomes a current record. The determination in step S207 is YES, the count value is incremented in step S204, and the state shown in FIG. 165E is entered.
The determination in step S205 is YES again, and a new record is read in step S206. The value of the key of the record is 2, and the value is different from the value of the previous record, which is 1. Therefore, the termination process is performed on a function in step S208. When a counting operation is performed, the termination process is only to fix the current count value, and the fixed value indicates the result of the process of the function and is added to the previous record, that is, `1` in this example, to be output as a resultant record in step S209. That is, the output resultant record is `1, 2`. FIG. 165F shows the result.
By repeatedly performing the processes, the following group of records can be finally obtained as a process result.
1, 2 PA1 2, 2 PA1 3, 3 PA1 4, 1 PA1 5, 2 PA1 3, 3 PA1 1, 2 PA1 4, 1 PA1 2, 2 PA1 5, 2 PA1 Support of items=number of transactions containing items/total number of transactions PA1 Confidence of `A-&gt;B`=support of A.LAMBDA.B (both A and B are bought)/support of A PA1 Confidence of `A.LAMBDA.B-&gt;C.LAMBDA.D.LAMBDA.E`=support of A.LAMBDA.B.LAMBDA.C.LAMBDA.D.LAMBDA.E / support of A.LAMBDA.B PA1 document 1: Maurice Houtsma and Arun Swami. Set-Oriented Mining for Association Rules in Relational Databases. In Proceedings of the IEEE Data Engineering Conference, pages 25-33, 1995. PA1 document 2: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th VLDB Conference, pages 487-499, Santiago, Chile, 1994.
The result indicates that the group of records to be processed in the group-by process contains two records having the key value of 1, two records having the key value of 2, three records having the key value of 3, one record having the key value of 4, and two records having the key value of 5.
FIG. 166 shows an example of a conventional group-by processing system based on the hash process. When FIG. 166 is compared with the flowchart based on the sort process shown in FIG. 164, the group of records to be processed in the group-by process are stored in one of record buffers according to the hash function values in step S221. Then, records in each record buffer are sorted according to key values in step S222, all sorted contents in the record buffers are connected to form a string, and the processes almost the same as the processes in and after step S202 shown in FIG. 164 are performed in steps S223 through S232.
FIGS. 167A through 167F show a practical example of the proceedings of the process performed according to the flowchart shown in FIG. 166. The group of records to be processed in the group-by process are the same as those shown in FIGS. 165A through 165F. When the process is based on the hash process, the group of records to be processed in the group-by process are hashed using an appropriate hash function. In this example, mod 3 is used as a hash function. That is, the value of the key is divided by 3, and the record is distributed to a hash bucket depending on the value of the remainder. If the remainder is i, a hash bucket i stores the record. In this example, the value of the remainder can be 0, 1, or 2, and there can be three hash buckets.
The state shown in FIG. 167B shows the state after the process in step S221 shown in FIG. 166 and the end of hash. FIG. 167C shows the result of sorting a group of records in each hash bucket according to key values. FIG. 167D shows the result of integrating the contents of all hash buckets into a string. Thus, the processes up to step S222 terminate. The subsequent processes are performed in the same manner as the sort process. Finally, the following records are obtained as a resultant group of records.
Although the order of the records is different from the order shown in FIGS. 165A through 165F, the entire group outputs the same result as the case based on the sort process.
Since the operation of counting the combinations of data is part of the association rule generation process in the above described database mining, the association rule is first described. The group-by processing system according to the present invention is used as part of the process of counting the combinations of data as described later.
For example, let's suppose that according to 100 customers' receipts collected through POS (point-of-sales) at a retail shop, 20 customers bought product A, and 12 customers bought both products A and B. One product is called an item, and a receipt slip is called a transaction. One transaction normally contains many items. Based on the following definition, the support of the product A is 20%, and the support of the products A and B is 12%.
Furthermore, based on a simple conditional probability computation, the conclusion that '60% (12%/20%) of the customers who buy the product A buy also the product B' can be obtained. This is represented by 'A-&gt;B, confidence of 60%, and support of 12%'. It is defined as an association rule. The confidence of the association rule `A-&gt;B` is computed by:
Furthermore, in addition to the simple rule such as `A-&gt;B`, etc., a complicated rule such as `A.LAMBDA.B-&gt;C.LAMBDA.D.LAMBDA.E` (a customer who buys A and B also buys C, D, and E), etc. can be applied. The confidence in this example can be computed as follows.
The association rule can be applied, for example, to the evaluation of the contribution of a special service product, the optimization of store shelves (arranging products for good combinations), the improvement of a hit ratio of direct mail from the data of credit cards, etc.
The association rule generating process includes the steps of: (1) counting the occurrences of combinations of items which satisfy the condition of a given support in the transactions; and (2) computing the rule, support, and confidence based on the combinations and their number of occurrences obtained in step (1). The present invention is designed to improve the counting process in step (1).
In what follows, a combination of items is referred to as `itemset`, and a combination of items which satisfy the condition of the given support and is found in step (1) is referred to as a `large itemset`. The condition of the support is in the range from the minimum value (0% [=counting any itemset purchased in a transaction] through 100% [=counting itemsets purchased in all transactions]) to the maximum value (minimum value&lt;=maximum value &lt;=100%). Conventionally, the maximum value is fixed to 100% in many cases.
Since the process of counting a large itemset takes a long time, various processing methods have been proposed. Especially, the SETM algorithm based on the SQL, and the Apriori algorithm which is one of the algorithms suggested by IBM, are known as typical algorithms. The SETM algorithm and the Apriori algorithm are described in the following documents 1 and 2.
The association rule generation according to the SETM algorithm is based on the SQL language which is a relational database query language, and can be easily implemented. In this process, an SQL join operation and a group-by operation are performed. A self-join operation is performed using a table of transactions containing the large itemsets of length k-1 (the large (k-1)-itemsets) in order to generate potential candidates of large k-itemsets (the candidate k-itemsets). Then, these candidate k-itemsets are counted in a group-by operation, and those satisfying the minimum support condition form the large k-itemsets. Furthermore, the transactions which contain large k-itemsets are generated in a join operation, and are used in generating the candidate (k+1)-itemsets.
FIG. 168 shows an illustrative example of SETM. FIG. 169 shows the contents of each function block in the SETM algorithm Using these figures, the SETM algorithm is described in details in what follows.
In FIG. 168, a table R1' shows the items contained in each transaction t_x. For example, transaction t_x 1 contains items 1, 2, and 3. A GB(1) performs the counting (a group-by operation) of the occurrences of each item. A table L1 contains the items and their number of occurrences in the transactions for the case that the minimum support value is 20% and maximum support value is 100%. Thus, all items which appear in two or more transactions are contained in L1.
A table R1 shows the result obtained in the join process J(1) by extracting from R1' the transactions whose items appear in table L1.
An SJ(1) does the self-join of table R1. As a result of the operation, possible 2-itemsets can be generated for each transaction and inserted in table R2'.
In the group-by process GB(2) on R2', the occurrences of 2-itemsets are counted. Those whose counts are equal to or larger than 2 compose the large 2-itemsets and are inserted in table L2.
Similarly, a table L3 for large 3-itemsets and a table L4 for large 4-itemsets are generated. However, the table L4 is empty.
In the Apriori algorithm, a candidate k-itemset is generated using the set of large (k-1)-itemsets. When the whole set of large (k-1)-itemsets are completely stored in memory, candidate k-itemsets are generated by first joining large (k-1)-itemsets. For each k-itemset resultant from this join, all (k-1) subsets are generated and checked against the large (k-1)-itemsets. Only when all the subsets are large (k-1)-itemsets, it is assumed that the join result of length k is a candidate k-itemset. Thus, in Apriori, large k-itemsets are stored in a hash-table in memory and can be used to efficiently prune the unnecessary candidates. Furthermore, the candidate k-itemsets are stored in a hash-tree in memory, and by scanning the transactions, the k-itemsets within the transactions and that are found in the hash-tree have their counter incremented. Unnecessary k-itemsets contained in the transactions are not counted by only counting those found in the hash-tree.
FIG. 170 shows an illustrative example of the Apriori algorithm. FIG. 171 shows the contents of each function block in the processing of the Apriori algorithm.
In FIG. 170, the contents of a list TL containing 8 transactions are actually the same as those shown in FIG. 168. First, each of the items contained in these transactions is input in Subset (1), and the occurrences of each item are counted and inserted in C1. The count result is input into F, and the items having the number of occurrences equal to or larger than 2 are selected through a filtering operation, and generate the set of large 1-itemsets L1, that are organized in a hash-table.
A self-join of all items in L1 is executed by AG(1), and the result is the set of candidate 2-itemsets C2, that is organized in a hash-tree. If a transaction list TL contains a 2-itemset that is stored in the hash-tree as a candidate itemset, the counter of that candidate itemset is incremented by Subset (2). By filtering the result through F, the candidate 2-itemsets having the occurrence number equal to or larger than 2 are in fact large itemsets L2 that are stored in a hash-table.
By performing similar processes, 3-itemsets having the occurrence numbers equal to or larger than 2 are obtained as large 3-itemsets L3. The process terminates when, as shown in FIG. 170, it is determined that no 4-itemsets having the occurrence number equal to or larger than 2 exist.
First, the problems with the conventional technology relating to the group-by processing system are described. As described above, the group-by processing system can be based on either the sorting process or the hash process. In the sorting process, the entire group of records should be simultaneously sorted, thereby requiring a long time and a high cost.
On the other hand, in the hash process, a lower cost is required than in the sorting process because the sorting operation is performed in the hash process only on a group of records contained in each record buffer. However, if the entire group of records to be processed in the group-by processing system is too large to be stored in the main memory, then the records in the record buffer should be written to the secondary storage device and the records should be read from the secondary storage device into the record buffer, and the access to the secondary storage device is costly.
That is, in the conventional hash processing system, record buffers are provided in the main memory, and a record is stored in the record buffer corresponding to the hash function value for the record. When a record buffer is full, the records in the record buffer are output to one of the hashed lists. Each hashed list corresponds to a record buffer, and is provided in the secondary storage device.
In the processes described above, each of the output processes is performed individually, and the records stored in one of the record buffers are sequentially output to the corresponding one of the hashed list in the secondary storage device. Therefore, in each of the hashed lists, each record is stored in a region next to a region in which the previous record is stored.
However, from the viewpoint of a memory region of the secondary storage device in which all of the hashed lists are included, the record output processes are non-sequentially performed, and the records output from each of the record buffers are discontinuously stored in the memory region.
Therefore, according to the conventional hash processing system, many empty regions are formed in the memory region of the secondary storage device, and this reduces the available storage capacity of the secondary storage device and the operating performance of the computer system.
Described below are some problems with the SETM algorithm and the Apriori algorithm for finding large itemsets in the generation of association rules. Concerning the SETM algorithm, since there are no means for pruning unnecessary combinations in the self-join operation SJ(k-1), Rk' becomes too large and thus, the subsequent group-by process GB(k) becomes too heavy.
The Apriori algorithm solves this problem of SETM algorithm. Apriori generates the candidate k-itemsets Ck to be counted in a pass k by joining the large (k-1)-itemsets L(k-1), and pruning those k-itemsets that contain any subset that is not a large (k-1)-itemset. This procedure results in the generation of a much smaller number of candidate itemsets Ck than in SETM. However, this pruning of k-itemsets requires that all the large (k-1)-itemsets are maintained in memory. If they can not fit in memory, the pruning can no longer be done. But, for AG(1) in Apriori, the join operation is almost a Cartesian Product of items in L1 and thus, the generation of C2 is a very heavy process. As illustrated in FIGS. 168 and 170, the number of 2-itemsets in C2 is even greater than the number of 2-itemsets in R2' for SETM, since for the generation of C2 in Apriori there is no restriction that the 2 items should belong to the same transaction.
Furthermore, in the Apriori algorithm, the k-itemsets in the transaction are counted based on the candidate k-itemsets that are stored in a hash-tree. Therefore, if all k-itemsets in the hash-tree can not be stored in the memory, they are partitioned so that each partition can fit in the available memory space. For each such partition, the transactions have to be scanned once. This is a problem when the set of transactions is too huge and its reading requires a large amount of time.