1. Field of the Invention
The present invention relates to a data mining algorithm, particularly to a fast algorithm for mining high utility itemsets, which is also called the absorptive mining algorithm.
2. Description of the Related Art
Data mining has been extensively applied to many fields, including business, medicine and education. However, the conventional technology of mining frequent itemsets [1] does not consider the profit or purchased quantity of each item but only pays attention to the frequency that each item appears in a transaction database. A store can learn the most popular combinations of products via mining frequent itemsets. However, the most popular products are not necessarily the highest-profit products, and the highest-profit products are usually non-popular products. For examples, milk plus bread is the most popular combination of products, which may occupy 6% in the total transactions but only contribute 1% for the total profit; the transactions include beverages and instant noodles may only occupy 2% in the total transactions but contribute as high as 7% for the total profit. Therefore, it is more favorable for the store to spend the limited marketing budget on the high-profit products than on the popular products. The combinations of high-profit products are called the high utility itemsets (HUI) thereinafter. Below are introduced some definitions to be used in the description of the present invention. Let I={i1, i2, . . . , im} be the set of all the items. An itemset X is a subset of I and the length of X is the number of items contained in X. A transaction database D={T1, T2, . . . , Tn} contains a set of transactions, and each transaction has a unique transaction identifier (TID). Ti (1≦i≦n) is one transaction thereinside and contains an itemset and the purchased quantities of the items of the itemset. The purchased quantity of item ip in a transaction Tq is denoted as o(ip, Tq). The utility of item ip in Tq is u(ip, Tq)=o(ip, Ts)×s(ip), wherein s(ip) is the profit of item ip. The utility of an itemset X in Tq is the sum of the utilities of items contained in X in Tq, which is shown in Expression (1). If X⊂/Tq, u(X, Tq)=0. The utility of an itemset X in D is the sum of the utilities of X in all the transactions containing X, which is shown in Expression (2). An itemset X is a high utility itemset if the utility of X in D is no less than a specified minimum utility (MU).
                              u          ⁡                      (                          X              ,                              T                q                                      )                          =                              ∑                                          i                p                            ∈              X                                ⁢                                          ⁢                      u            ⁡                          (                                                i                  p                                ,                                  T                  q                                            )                                                          (        1        )                                          u          ⁡                      (            X            )                          =                              ∑                          X              ⊆                              T                q                            ∈              D                                ⁢                                          ⁢                      u            ⁡                          (                              X                ,                                  T                  q                                            )                                                          (        2        )            
TABLE 1transaction databaseItemTIDABCDEFT1103001T2040500T3725700T4010040T5200901T6005070T70100300T8022030T9810500T10052300
TABLE 2profit tableItemABCDEFProfit($)72511013
For example, Table.1 is a transaction database, in which each number represents the purchased quantity for an item in a transaction. Table.2 is the profit table which records the profit for each item in Table.1. Suppose the minimum utility MU is 100. The utility of itemset {C, E} in Table.1 is u({C,E})=(5×5+7×10)+(2×5+3×10)=135≧100. Therefore, the itemset {C, E} is a high utility itemset. For mining frequent itemsets [1], all the subsets of a frequent itemset are also frequent itemsets, that is, there is a downward closure property for frequent itemsets. However, the property is not available for high utility itemsets since a subset of a high utility itemset is not necessarily a high utility itemset. For example, itemset {C, E} is a high utility itemset in Table.1, but its subset {C} s not a high utility itemset because the subset {C} has utility u({C})=((3×5)+(5×5)+((5×5)+(2×5)+(2×5)=85<100.
Therefore, some researchers proposed a Two-Phase algorithm for mining high utility itemsets. They defined transaction utility TU and transaction weighted utility TWU for an itemset X, which are respectively shown in Expressions (3) and (4).
                              tu          ⁡                      (                          T              q                        )                          =                              ∑                                          i                p                            ∈                              T                q                                              ⁢                                          ⁢                      u            ⁡                          (                                                i                  p                                ,                                  T                  q                                            )                                                          (        3        )                                          twu          ⁡                      (            X            )                          =                              ∑                          X              ⊆                              T                q                            ∈              D                                ⁢                                          ⁢                      tu            ⁡                          (                              T                q                            )                                                          (        4        )            
If the TWU for an itemset is no less than MU, the itemset is a high transaction weighted utility itemset (HTWUI). According to Expression (4), the TWU for an itemset X must be greater than or equal to the utility of X in D. Therefore, if X is a high utility itemset, X is an HTWUI also. All the subsets of an HTWUI are also HTWUIs. Therefore, there is a downward closure property for HTWUIs. The first phase for the Two-Phase algorithm [4] is to find all the HTWUIs which are called candidate high utility itemsets by applying the Apriori algorithm [1]. The Two-Phase algorithm scans the database again to compute the utilities of all the candidate high utility itemsets and find out high utility itemsets in the second phase.
The Two-Phase algorithm would not neglect any high utility itemset. In the first phase, the Two-Phase algorithm repeatedly scans the database and searches a large number of candidate HTWUIs to generate candidate high utility itemsets. TWU of an itemset is likely to overestimate the utility of the itemset. Especially when the minimum utility is small, a huge number of candidate high utility itemsets will be generated. In the second phase, the Two-Phase algorithm needs to scan the large database again and search a huge number of candidate high utility itemsets, which would significantly degrade the mining performance.
Some researchers were devoted to reduce the number of candidate high utility itemsets generated in the first phase of the Two-Phase algorithm, wherein the value of TWU is decreased to avoid overestimating the utility of an itemset with the downward closure thereof being kept. The technology indeed generates less candidate high utility itemsets and the Two-Phase algorithm. However, it still adopts the scheme of the Two-Phase algorithm and still needs to generate candidate high utility itemsets and scan the whole database to find the high utility itemsets.
In order to solve the problem of scanning the database repeatedly, Ahmed et al. proposed an HUC-Prune algorithm[2], which applies the FP-Growth algorithm[3] to compute the TWU of itemsets to find out candidate high utility itemsets and then scans the database and searches the candidates to find the high utility itemsets. The HUC-Prune algorithm only scans the database three times. However, it still needs to respectively generate candidate high utility itemsets and scan the whole database in two phases so as to find out the high utility itemsets. In fact, the HUC-Prune algorithm still generates too many candidate high utility itemsets.
Accordingly, the present invention proposes a fast algorithm for mining high utility itemsets to solve the problems of the conventional technologies.