1. Field of the Invention
The present invention relates to a data processing system, especially to a hash-tree operation performed in mining data for discovering unknown rules in databases and in calculating algorithm for the data mining.
2. Description of the Related Art
It is known well that there are association rules in knowledge obtained by data mining for a large database. The association rules usually relate to sets of items often appearing in the same record. The rules are typically utilized for marketing strategies in retail sales. For example, an association rule is discovered by analyzing customer purchase records or accumulated sales receipt data in order to find a purchase tendency. Based on the purchase tendency, things or services bought at the same time can be known, which helps to develop sales and implement marketing strategies. The xe2x80x9crecordxe2x80x9d in this specification means a list of items bought by a customer.
As a method of discovering an association rule in database relating to itemsets in a record, an algorithm called xe2x80x9cApriorixe2x80x9d by R. Agrawal et al. is disclosed in the article, xe2x80x9cFast Algorithms for Mining Association Rulesxe2x80x9d (Proc. of 20th VLDB, 1994) and in Unexamined Japanese Patent Publication 8-287106 (Priority is claimed in U.S. Priority Number; 415006, Priority Date: Mar. 31, 1995).
In xe2x80x9cApriorixe2x80x9d, two indexes xe2x80x9csupportxe2x80x9d and xe2x80x9cconfidencexe2x80x9d are used as a criterion for discovering an association rule. However, we herein use another criterion xe2x80x9cfrequencyxe2x80x9d instead of the xe2x80x9csupportxe2x80x9d to explain xe2x80x9cApriorixe2x80x9d in view of our present invention.
An example of association rule will be described. When there are two itemsets [A, B, . . . , X] and [Y], the association rule between the itemsets is expressed by the form
A, B, . . . , Xxe2x86x92Y
where the number of records including all of A, B, . . . , X, Y is called a frequency of the rule and
a ratio of records including A, B, . . . , X, Y to the records including A, B, . . . , X is called a confidence.
In the xe2x80x9cApriorixe2x80x9d, an association rule having a frequency and a confidence respectively greater than a predefined lowest value (minimum frequency and minimum confidence) is selected.
By implementing an association rule discovering system shown in FIGS. 30 and 31, the xe2x80x9cApriorixe2x80x9d can be realized. Now, the procedure of this method is explained with reference to FIG. 31, using database of FIG. 33 to simply explain the association rule discovering. Each record in the database of FIG. 33 has a record ID and includes items expressed by integers equal to or more than 1.
An itemset having k items is called a k-itemset. (k is an integer equal to or more than 2.) A set of k-itemsets whose frequencies are equal to or more than a minimum frequency is called a large-itemset Lk having length k. A set of k-itemsets potentially to be elements of the large-itemset Lk is called a candidate-itemset Ck having length k. Namely, k-itemsets, whose frequencies are equal to or more than the minimum frequency, in Ck are selected to be elements of Lk.
At a user input step 100 in FIG. 31, a minimum frequency and a minimum confidence are obtained from the user through a user input unit 10. At an L1 generating step 110, a candidate itemset verifying unit 21 selects a record from a database 1 one by one, counts appearing times of each item in the record, and increases a counting number (frequency) of the item. If a new item appears, a counting area for the new item is newly provided. After all the items in all the records have been counted, only items each of which has a total counted frequency more than the minimum frequency are registered in a hash-tree.
FIG. 34 shows the case that each frequency of five items, 1, 2, 3, 4, and 5 is more than the minimum frequency and the five items have been registered in the hash-tree. Both ends of each branch of the hash-tree are called nodes and generally item numbers are correspondingly assigned to the nodes. At the beginning end of the hash-tree, no item number is assigned to the node, which is called a root. The number of branches, from the root to the last node, is called a branch length. Therefore, each branch length in FIG. 34 is 1.
At a Ck generating step 120, a candidate-itemset Ck is generated from a large-itemset Lkxe2x88x921 having length kxe2x88x921 by a candidate-itemset generating unit 22. In the initial state shown in FIG. 34, k equals 2 and C2 is generated from L1.
Now, the case of C3 being generated from L2 will be explained. FIG. 35 illustrates a hash-tree made up to the state of L2. The detail of the Ck generating step of FIG. 31 is shown in a block diagram of FIG. 32 having two-step procedures: a join step 121 and a prune step 122.
Referring to FIGS. 35 and 36A, the join step 121 is described with reference to a node (called an original node here) at a branch end of kxe2x88x921 long. New branches are extended in order to make child nodes for the original node. The child nodes should have item numbers larger than the item number assigned to the original node out of item numbers assigned to other nodes having the same parent node as the original node. For instance, the node shown as rootxe2x86x921xe2x86x923 in FIG. 35 is expressed by [1, 3]. (This expression is used hereinafter for representing a node in the hash-tree). With respect to the node [1,3], nodes [1,4] and [1,5], which have the same parent node as the node [1,3] and whose item numbers 4 and 5 are larger than 3, are joined to the node [1,3] to be [1,3,4] and [1,3,5]. With respect to the node [1,4], the node [1,5] is joined to be [1,4,5]. No new branch is extended for the node [1,5] because there is no larger item number than 5 for [1]. The above procedure is illustrated in FIG. 36A as a state before pruning.
The prune step 122 will now be explained. A branch indicating an itemset, whose length is extended to k, has been made at the join step 121. Then, every (kxe2x88x921)-itemset, made by deleting one item from the k-itemset, is checked whether the (kxe2x88x921)-itemset is included in Lkxe2x88x921 or not. Only when all the (kxe2x88x921)-itemsets are included in Lkxe2x88x921, the k-itemset is left to be utilized. If there is at least one (kxe2x88x921)-itemset which is not included in Lkxe2x88x921, the k-itemset is deleted.
For instance, in the case of checking [1,3,4], three 2-itemsets [1,3], [1,4], and [3,4] are checked to be in L2 or not. Referring to FIG. 36A, as all of these three itemsets are included in L2, [1,3,4] is left. In the case of checking [1,3,5], three 2-itemsets [1,3], [1,4], and [3,4] and [3,5] are checked to be in L2 or not. Since [3,5] does not exist in L2, [1,3,5] is deleted.
All the k-itemsets made in the join step 121 are checked at the prune step 122. A hash-tree after the pruning is shown in FIG. 36B.
After the Ck generating step 120, an Lk generating step 130 is performed by the candidate-itemset verifying unit 21. In the Lk generating step 130, a record is selected from the database one by one to count the number of k-itemsets in Ck. Then, only k-itemsets whose frequencies are more than the minimum frequency are left to be elements of Lk. Matching is performed between the record and the hash-tree in counting the number of k-itemsets.
Now, the matching will be explained. First, a record is selected from the database one by one, and checking is performed for the selected record along the hash-tree from the root. Then, it is checked whether an item corresponding to a child node of the root exists in the record or not. If such an item does not exist in the record, the matching for the record is finished to check the next record. When such an item exists in the record, it is checked whether an item corresponding to a next child node (grandchild node of the root) exists in the record or not.
This operation is repeatedly performed. If a child node does not have any branch at its lower level in the hash-tree, such a child node is called a leaf. In checking a record, when a node""s child node is a leaf and an item corresponding to the leaf exists in the record, the frequency counting for the leaf node is increased, which means that the matching for the record is completed. At the time of all the matching procedures having been completed for every record, the frequency count value for each leaf node represents the frequency of each k-itemset (an itemset from the root to the leaf node).
As stated above, the frequency of each k-itemset is counted, and then a large-itemset Lk is generated by selecting k-itemsets having frequencies more than the minimum frequency as the elements of the large-itemset Lk.
If, at the Lk generating step, no k-itemset is selected to be an element of Lk, it goes to a candidate rule generating step 150. The candidate rule means a rule having a possibility to be defined as an association rule. When there is a k-itemset to be included Lk, one is added to k and it goes back to the Ck generating step 120.
At the candidate rule generating step 150, a candidate rule generating unit 41 generates a candidate for the association rule from the large-itemsets generated through the previous steps. Accordingly, k candidate rules are generated from a k-itemset in Lk. In the right hand side (RHS) of the k candidate rules, an item out of items in k-itemset is described. In the left hand side (LHS) of the k candidate rules, kxe2x88x921 items made by deleting the item in RHS from the items in k-itemset are described. This candidate rule generating process is performed for all the k-itemsets in Lk in the case of k greater than =2.
The RHS indicates an itemset to be a conclusion of the association rule and the LHS indicates itemsets to be conditions of the association rule. The definition of the RHS and LHS is also applied to the candidate association rules.
At a rule testing step 160, a confidence calculating unit 42 calculates a confidence of each candidate rule. When the confidence of a candidate rule is larger than the minimum confidence, the candidate rule is added to an association rule set. As stated above, the confidence of the candidate rule A1, A2, . . . , Akxe2x86x92B is calculated by the following expression:
confidence=s(A1,A2, . . . ,Ak,B)/s(A1,A2, . . . ,Ak)
where s (xcex8) is frequency of itemset xcex8.
Another related art is disclosed by R. Srikant et al. as an improved art of Apriori, which is entitled xe2x80x9cMining Generalized Association Rulesxe2x80x9d, in Proceedings of the 21st VLDB Conference, 1995. The following procedures are described in this improved art:
(1) obtaining an association rule using Apriori
(2) eliminate statistically meaningless rules out of the association rules obtained at (1), by using a chi-square test. Consequently, an association rule whose confidence is larger than a minimum confidence and also which has statistical meaning can be selected.
The operation of hash-tree storing itemsets is very significant for the algorithm used in discovering association rules. When the number of itemsets is large, it may happen that the hash-tree is too large to be included in the computer memory. In such a state of too many itemset kinds, paging of hash-tree node data happens at the time of matching between the record and the hash-tree, which enormously decreases data processing speed.
In calculating frequencies at the Lk generating step, a record is selected from the database one by one to perform matching between each record and the hash-tree. The matching process between a record and the hash-tree, which is performed by recursively utilizing a matching function, will now be explained.
As shown in FIG. 56, a hash-tree node and a partial sequence (p) of a record are used as parameters of the matching function. When the matching function is firstly utilized, a root is used as the node and a record is used as the partial sequence. The partial sequence is a set of some items located at and after a specific position in the record. For instance, {2,3 } is the partial sequence at and after the second item in the record of {1,2,3}.
FIG. 56 is a block diagram showing a conventional matching function. At a step 2100, it is checked whether or not an input hash-tree node is an end of a branch having a leaf node at the other end, in other words it is checked whether or not the input hash-tree node is one level upper than the leaf node.
If the input hash-tree node is not one level upper than the leaf node, at a step 2110, a hash function is applied to the first item (i) of a partial sequence in order to examine whether a node corresponding to the item i (nodei) exists in the lower level or not. When the corresponding node (nodei) exists in the lower level, a matching function where the node (nodei) and another partial sequence made by deleting the node item (i) from the original partial sequence (p) are used as parameters is recursively utilized at a step 2120. The original partial sequence (p) is updated by deleting the item (i) corresponding to the node from the original partial sequence at a step 2130. If the corresponding node (nodei) does not exist in the lower level at the step 2110, only the partial sequence is updated at the step 2130 without recursively utilizing the matching function.
The process from the step 2110 to 2130 is repeatedly performed until it is judged at a step 2140 that all the items in the original partial sequence have been deleted, meaning no item exists in the partial sequence (p).
When the node input at the step 2100 is one level upper than a leaf node, the hash function is applied to each item (i) in the parameter partial sequence (p), at a step 2150. Then, if the corresponding leaf node (nodei) exists, the frequency of the leaf node is increased by one.
FIG. 57 shows the case that the matching function is applied to the hash-tree root and a partial sequence {1,2,3} of a record {1,2,3}. As the height of the hash-tree is 2, the matching function is recursively utilized.
[1] exists at the lower level of the root. After the steps 2100 and 2110, the matching function where the node [1] and the partial sequence {2,3} are used as parameters is recursively utilized at the step 2120. The partial sequence {1,2,3} is updated to {2,3} at the step 2130. Similarly, at the step 2120, after the steps 2100 and 2110, the matching function where the node [2] and the partial sequence {3} are used as parameters is recursively utilized. At the step 2130, the partial sequence {2,3} is updated to {3}. Then, at the step 2120, after the steps 2100 and 2110, the matching function where the node [3] and the partial sequence { } are used as parameters is recursively utilized. At the step 2130, the partial sequence {3} is updated to { }. As no item exists in the partial sequence at the step 2140, the loop process is finished.
The operation of the matching function where the node [1] and the partial sequence {2,3} are used will now be explained. This node is judged to be one level upper than a leaf node, at the step 2100. Then, at the step 2150, the hash function is applied to each of items 2 and 3 in the partial sequence. At a step 2160, it is checked whether there are branches to nodes [1,2] and [1,3] or not, and it goes to the next step as there are branches. At a step 2170, each frequency of the nodes [1,2] and [1,3] is increased by one. Similarly, when the matching function of the node [2] and the partial sequence {3}, which is recursively utilized, is used, the frequency of the node [2,3] is increased by one. When the matching function of the node [3] and the partial sequence { } is utilized, the process is finished without increasing the frequency.
In the conventional art, candidates for the association rules are generated based on large-itemsets having frequencies more than the minimum frequency. Therefore, if a large-itemset has a both sides (BS) frequency, meaning a frequency of itemset made of all the items in RHS and LHS, less than a minimum frequency, it is impossible to obtain an association rule for the case. Namely, a negative association rule showing a tendency, for instance, that an item and another item rarely appear in the same record, can not be obtained.
Negative association rules sometimes show, depending upon databases, as significant information as positive association rules. For instance, the following, information can be obtained by negative association rules: the information, based on machine maintenance data, that machines given treatment A seldom have failure B, and the information, based on product manufacturing data, that products made of material C seldom have defect D. Since BS frequencies of negative association rules are very low, it is impossible to obtain the negative association rules by using only large-itemsets. Conventionally, an association rule in the case of frequencies of RHS and LHS being more than a minimum frequency but a frequency of BS being less than the minimum frequency can not be obtained, even if the association rule has a statistical significance.
The conventional art xe2x80x9cApriorixe2x80x9d discovers association rules based on a criterion xe2x80x9cminimum confidencexe2x80x9d and sometimes discovers rules statistically useless. Namely, association rules by Apriori do not have good quality. In another conventional art by Mr. Srikant, rules statistically useless are removed out of conclusions obtained by Apriori, by performing a chi-square test, which needs more processes than Apriori. In addition, some rules, which are to be statistically significant and to be judged being significant by the chi-square test, are not discovered because their confidences are lower than minimum confidence.
In the conventional art, it is impossible to effectively and respectively discover a positive association rule and a negative association rule.
In the conventional art, it is impossible to effectively obtain association rules because users are needed to input a minimum frequency and the tests are may be performed even for candidate rules having useless frequencies.
In the conventional art, there is no means for effectively appointing candidate items to be included in RHS or LHS of an association rule. For instance, the rule, in which 2 or 4 is included in RHS and 1 or 3 is included in LHS, can not be appointed. Therefore, the needed association rule is obtained only after all the steps being completed, which contains many useless processes.
In the conventional art, there is no means for effectively appointing items to be certainly included in RHS and LHS of an association rule. For instance, the rule, in which 1 and 4 are certainly included in RHS and 2 is certainly included in LHS, can not be appointed. Therefore, the needed association rule is obtained only after all the steps being completed, which contains many useless processes.
In the conventional art, it is impossible to restrict a domain in the database for obtaining association rules. For instance, there is no obtaining rules with restricting records, each of which includes an item 3.
In the conventional art, numbers such as integers corresponding to items are assigned regardless of frequencies of the items in the database. Therefore, a hash-table size at each hash-tree node is indefinite, which makes the hash function complicated.
In the conventional art, the matching between each record and a hash-tree is performed for all the combinations of items in the record with the hash-tree. Therefore, when the record is long, matching efficiency is extremely deteriorated.
In the conventional art, there is no means for effectively treating with the request that, for instance, 2 and 4 should not appear at the same time in association rules. Therefore, such requested association rules can be obtained by deleting non-requested association rules (rules including 2 and 4 at the same time, in this case) only after all the steps having been completed, which contains many useless processes. Now, an example of the reason for the above being requested will be explained. If customer purchasing record data at some retail shop includes an item xe2x80x9cmalexe2x80x9d and another item xe2x80x9cfemalexe2x80x9d, there is a possibility of discovering a useless negative association rule such as xe2x80x9cthe case of male never means the case of femalexe2x80x9d. In addition, as to discovering a positive association rule, there is a case of retrieving and verifying an itemset including the items xe2x80x9cmalexe2x80x9d and xe2x80x9cfemalexe2x80x9d because the itemset is included in a candidate itemset, which indicates useless processes have been executed.
In the conventional art, when a hash-tree becomes too large to be included in a memory, paging of hash-tree node data would happen at the time of matching the record with the hash-tree, which enormously decreases processing speed. In the case of the hash-tree being unbalanced, the paging still may happen even after dividing the tree, because the divided hash-tree may be still too large to be within the memory.
In the conventional art, it is necessary to retrieve database at the rule generating step, which makes the process speed slow.
In the conventional art, there are a lot of candidate-itemsets, which also makes the process speed slow.
In the conventional art, a record in the database is selected one by one to perform matching between each record and the hash-tree, in calculating frequencies at the Lk generating step. Therefore, the processes of recursively utilizing the matching function are frequent, which makes the matching process speed slow.
The present invention is provided to solve the above-mentioned problems. It is an object of the present invention to provide an association rule discovering method by which association rules can be effectively obtained.
It is another object of the present invention to provide an association rule discovering method by which association rules including negative association rules can be obtained regardless of a BS (both sides of the left hand side and the right hand side of an association rule) frequency.
It is another object of the present invention to provide an association rule discovering method by which only statistically significant association rules are obtained.
It is another object of the present invention to provide an association rule discovering method by which only positive association rules or only negative association rules are effectively selected.
It is another object of the present invention to provide an association rule discovering method in which the user does not need to input a minimum frequency and no candidate rule having a useless frequency is tested. Accordingly, the association rules can be effectively obtained.
It is another object of the present invention to provide an association rule discovering method by which association rules regarding specific items can be effectively obtained.
It is another object of the present invention to provide an association rule discovering method by which association rules, in the case of the domain being restricted, can be effectively selected.
It is another object of the present invention to provide an association rule discovering method in which paging in the hash-tree is suppressed in order to perform high-speed processing.
It is another object of the present invention to provide an association rule discovering method by which a set of items not to appear in the association rule at the same time is appointed.
It is another object of the present invention to provide an association rule discovering method by which it is not necessary to retrieve the database. Accordingly, the processing time can be reduced.
It is another object of the present invention to provide an association rule discovering method by which the number of candidate-itemsets is lessened. Accordingly the processing time can be reduced.
It is another object of the present invention to provide an association rule discovering method by which the times of recursively utilizing the hash function is lessened. Accordingly, high speed matching can be performed.
It is another object of the present invention to provide an association rule discovering apparatus by which association rules including negative association rules can be effectively obtained regardless of a BS frequency.
It is another object of the present invention to provide an association rule discovering apparatus by which paging in the hash-tree is suppressed in order to perform high-speed processing.
It is another object of the present invention to provide an association rule discovering apparatus by which the times of recursively utilizing the hash function is lessened. Accordingly, high speed matching can be performed.
According to one aspect of the present invention, a method for discovering an association rule existing between itemsets composed of one or more than one items, from a database storing a plurality of records composed of one or more than one items,
where k is an integer equal to or more than 2, and n indicates an integer from 1 to k,
when an itemset is composed of n items and a frequency, meaning a number of records including the n items, of the itemset has not been checked, the itemset is defined as a candidate-itemset Cn, and
when a frequency of the candidate-itemset Cn is equal to or more than a lower limit value Smin, the candidate-itemset Cn is defined as a large-itemset Ln,
the method comprises the steps of:
(a) user""s inputting a parameter necessary for obtaining the association rule,
(b) generating a large-itemset, this step includes the steps of:
(b1) generating a large-itemset L1 by counting a frequency meaning a number of records including each item, and defining an itemset composed of items made of the each item having a frequency equal to or more than the lower limit value Smin as the large-itemset L1;
(b2) generating a candidate-itemset Ck by using a large-itemset Lkxe2x88x921 and the large-itemset L1; and
(b3) generating a large-itemset Lk by selecting the large-itemset Lk from the candidate-itemset Ck; and
(c) generating and testing a hypothesis, this step includes the steps of:
(c1) generating a candidate association rule by using the large-itemset Lkxe2x88x921 and the large-itemset L1, where the large-itemset Lkxe2x88x921 is defined as a condition itemset called a left hand side (LHS) and the large-itemset L1 is defined as a conclusion itemset called a right hand side (RHS); and
(c2) testing a rule for testing the candidate association rule to be one of applied and not-applied as the association rule.
According to another aspect of the present invention, a method for discovering an association rule existing between itemsets composed of one or more than one items, from a database storing a plurality of records composed of one or more than one items,
where k is an integer equal to or more than 2, and n indicates an integer from 1 to k,
when an itemset is composed of n items and a frequency, meaning a number of records including the n items, of the itemset has not been checked, the itemset is defined as a candidate-itemset Cn, and
when a frequency of the candidate-itemset Cn is equal to or more than a lower limit value Smin, the candidate-itemset Cn is defined as a large-itemset Ln,
the method comprises the steps of:
(a) generating a large-itemset L1 by counting a frequency meaning a number of records including each item, and defining an itemset composed of items made of the each item which has a frequency equal to or more than the lower limit value Smin as the large-itemset L1;
(b) generating a candidate-itemset Ck by extending a branch of a hash-tree which stores a large-itemset Lkxe2x88x921;
(c) dividing the hash-tree into partial trees to be within a specific amount;
(d) generating a large-itemset Lk by selecting the large-itemset Lk based on a matching between each divided partial tree and the database;
(e) generating a candidate association rule; and
(f) testing a rule for testing the candidate association rule to be one of applied and not-applied as the association rule.
According to another aspect of the present invention, a method for discovering an association rule existing between itemsets composed of one or more than one items, from a database storing a plurality of records composed of one or more than one items,
where k is an integer equal to or more than 2, and n indicates an integer from 1 to k,
when an itemset is composed of n items and a frequency, meaning a number of records including the n items, of the itemset has not been checked, the itemset is defined as a candidate-itemset Cn,
when a frequency of the candidate-itemset Cn is equal to or more than a lower limit value Smin, the candidateiitemset Cn is defined as a large-itemset Ln, and
where a memory amount can be used for generating the association rule in a computer is defined as an allowable amount, and a large-itemset file is used for storing data on the large-itemset Ln,
the method comprises the steps of:
(a) generating a large-itemset L1 by counting a frequency meaning a number of records including each item, defining an itemset composed of items made of the each item having a frequency equal to or more than the lower limit value Smin as the large-itemset L1, assigning optional continuous numbers to the each item in the large-itemset L1, and storing data on the large-itemset L1 in the large-itemset file;
(b) reading the large-itemset file for reading data on (kxe2x88x921)-itemsets in a large-itemset Lkxe2x88x921 from the large-itemset file; and storing the data in a hash-tree;
(c) generating a candidate-itemset Ck by extending a branch of the hash-tree;
(d) checking an amount by comparing an amount of the hash-tree which stores the candidate-itemset Ck with a specific amount less than the allowable amount, going back to the step of reading the large-itemset file when the amount of the hash-tree is less than the specific amount, and going to a next step when the amount of the hash-tree is equal to or more than the specific amount;
(e) generating a large-itemset Lk by selecting the large-itemset Lk based on a matching between the candidate-itemset Ck and the database; and
(f) generating a rule by generating a candidate association rule and testing the candidate association rule to be one of applied and not-applied as an association rule.
According to another aspect of the present invention, a method for discovering an association rule existing between itemsets composed of one or more than one items, from a database storing a plurality of records composed of one or more than one items,
where k is an integer equal to or more than 2, and n indicates an integer from 1 to k,
when an itemset is composed of n items and a frequency, meaning a number of records including the n items, of the itemset has not been checked, the itemset is defined as a candidate-itemset Cn, and
when a frequency of the candidate-itemset Cn is equal to or more than a lower limit value Smin, the candidate-itemset Cn is defined as a large-itemset Ln,
the method comprises the steps of:
(a) generating a large-itemset L1 by counting a frequency meaning a number of records including each item, and defining an itemset composed of items made of the each item having a frequency equal to or more than the lower limit value Smin as the large-itemset L1;
(b) generating a candidate-itemset Ck;
(c) generating a large-itemset Lk by performing a matching between a set of records in the database and a hash-tree storing the candidate-itemset Ck, and selecting the large-itemset Lk;
(d) generating a candidate association rule; and
(e) testing a rule for testing the candidate association rule to be one of applied and not-applied as the association rule.
According to another aspect of the present invention, an apparatus for discovering an association rule existing between itemsets composed of one or more than one items, where k is an integer equal to or more than 2 and n indicates an integer from 1 to k, the apparatus comprises:
(a) a database storing a plurality of records composed of one or more than one items;
(b) a user input unit for inputting a parameter necessary for obtaining the association rule;
(c) a memory area for storing a large-itemset Ln,
when an itemset is composed of n items and a frequency, meaning a number of records including the n items, of the itemset has not been checked, the itemset is defined as a candidate-itemset Cn, and
when a frequency of the candidate-itemset Cn is equal to or more than a lower limit value Smin, the candidate-itemset Cn is defined as the large-itemset Ln;
(d) a large-itemset generating unit, this unit includes:
(d1) a candidate-itemset verifying unit for
counting a frequency meaning a number of records including each item,
defining an itemset composed of items made of the each item which has a frequency equal to or more than the lower limit value Smin as a large-itemset L1, and
selecting a large-itemset Lk from a candidate-itemset Ck; and
(d2) a candidate-itemset generating unit for generating the candidate-itemset Ck by using a large-itemset Lkxe2x88x921 and the large-itemset L1;
(e) a hypothesis generating and testing unit, this unit includes:
(e1) a candidate rule generating unit for generating a candidate association rule by using the large-itemset Lkxe2x88x921 and the large-itemset L1, in the candidate association rule the large-itemset Lkxe2x88x921 is defined as a condition itemset called a left hand side (LHS) and the large-itemset L1 is defined as a conclusion itemset called a right hand side (RHS); and
(e2) a rule testing unit for testing the candidate association rule to be one of applied and not-applied as the association rule.
According to another aspect of the present invention, an apparatus for discovering an association rule existing between itemsets composed of one or more than one items, where k is an integer equal to or more than 2 and n indicates an integer from 1 to k, the apparatus comprises:
(a) a database storing a plurality of records composed of one or more than one items;
(b) a memory area for storing a large-itemset Ln,
when an itemset is composed of n items and a frequency, meaning a number of records including the n items, of the itemset has not been checked, the itemset is defined as a candidate-itemset Cn, and
when a frequency of the candidate-itemset Cn is equal to and more than a lower limit value Smin, the candidate-item set Cn is defined as the large-itemset Ln;
(c) a large-itemset generating unit, this unit includes:
(c1) a hash-tree operating unit for dividing a hash-tree into partial trees to be within a specific amount;
(c2) a candidate-item set verifying unit for
counting a frequency meaning a number of records including each item,
defining an itemset composed of items made of the each item which has a frequency equal to or more than the lower limit value Smin as a large-itemset L1, and
selecting a large-itemset Lk based on a matching between each divided partial tree and the database;
(c3) a candidate-itemset generating unit for generating a candidate-itemset Ck by extending a branch of the hash-tree which stores a large-itemset Lkxe2x88x921;
(d) a hypothesis generating and testing unit, this unit includes:
(d1) a candidate rule generating unit for generating a candidate association rule, and
(d2) a rule testing unit for testing the candidate association rule to be one of applied and not-applied as the association rule.
According to another aspect of the present invention, an apparatus for discovering an association rule existing between itemsets composed of one or more than one items, where k is an integer equal to or more than 2 and n indicates an integer from 1 to k, the apparatus comprises:
(a) a database storing a plurality of records composed of one or more than one items;
(b) a large-itemset file for storing data on a large-itemset Ln,
when an itemset is composed of n items and a frequency, meaning a number of records including the n items, of the itemset has not been checked, the itemset is defined as a candidate-item set Cn, and
when a frequency of the candidate-itemset Cn is equal to or more than a lower limit value Smin, the candidate-itemset Cn is defined as the large-itemset Ln;
(c) a large-itemset generating unit, this unit includes:
(c1) a candidate-itemset verifying unit for
counting a frequency meaning a number of records including each item,
defining an itemset composed of items made of the each item which has a frequency equal to or more than the lower limit value Smin as a large-itemset L1, and
selecting a large-itemset Lk based on a matching between a candidate-itemset Ck and the database;
(c2) a candidate-itemset generating unit for generating the candidate-itemset Ck by extending a branch of a hash-tree; and
(c3) a hash-tree operating unit for
assigning optional continuous numbers to the each item,
storing data on the large-itemset L1 in the large-itemset file,
reading data on (kxe2x88x921)-itemsets in a large-itemset Lkxe2x88x921 from the large-itemset file,
storing the data in a hash-tree,
comparing an amount of the hash-tree which stores the candidate-itemset Ck with a specific amount less than an allowable memory amount of a computer, and
making go back to a step of reading the large-itemset file when the amount of the hash-tree is less than the specific amount, and making go to a next step when the amount of the hash-tree is more than the specific value; and
(d) a hypothesis generating/testing unit for generating a candidate association rule and testing the candidate association rule to be one of applied and not-applied as the association rule.
According to another aspect of the present invention, an apparatus for discovering an association rule existing between itemsets composed of one or more than one items, where k is an integer equal to or more than 2 and n indicates an integer from 1 to k, the apparatus comprises:
(a) a database storing a plurality of records composed of one or more than one items;
(b) a memory area for storing a large-itemset Ln,
when an itemset is composed of n items and a frequency, meaning a number of records including the n items, of the itemset has not been checked, the itemset is defined as a candidate-itemset Cn, and
when a frequency of the candidate-itemset Cn is equal to or more than a lower limit value Smin, the candidate-itemset Cn is defined as the large-itemset Ln;
(c) a large-itemset generating unit, this unit includes:
(c1) a candidate-itemset verifying unit for
counting a frequency meaning a number of records including each item,
defining an itemset composed of items made of the each item which has a frequency equal to or more than the lower limit value Smin as a large-itemset L1, and
selecting a large-itemset Lk based on a matching between a set of records in the database and a hash-tree which stores a candidate-itemset Ck; and
(c2) a candidate-itemset generating unit for generating the candidate-itemset Ck;
(d) a hypothesis generating and testing unit, this unit includes:
(d1) a candidate rule generating unit for generating a candidate association rule; and
(d2) a rule testing unit for testing the candidate association rule to be one of applied and not-applied as the association rule.
Further scope of applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art form this detailed description.