1. Field of the Invention
The invention relates a data mining apparatus for analyzing a large body of data stored in a data base and discovering an association rule existing between attributes of the stored data.
2. Description of the Prior Art
A data mining apparatus discovers rules or causal relationships between data items from a large body of data stored in a data base. The typical example is the technology for mining association rules expressing relationships between the stored data items. As a specific example, the rule or association "when a data item A (subset) and a data item B (subset) exist in the same transaction, a data item C (subset) also commonly exists" is expressed as "A, B.fwdarw.C". A typical application of mining association rule is called basket analysis. Basket analysis determines an association of goods that customers put in their baskets (or shopping bags) during a trip to a retail shop. In the basket analysis, for example, the association rule "bread.fwdarw.milk" (the customer who buys the bread also buys the milk at a time) can be obtained by the association analysis using the accumulated sales receipt data.
The fundamental processing of the association analysis in the data mining system generates and verifies association rule candidates. In other words, the analysis generates association rule candidates from the combination of stored data items and verifies whether each candidate is interesting or not by counting the number of records satisfying the rule. Since it is not efficient to output every association rule, however, the conventional data mining system narrows the number of association rule candidates based on the criteria of support and confidence so that the useful association rules are found efficiently.
The support is a criterion signifying the generality of the association rule, and the confidence is a criterion signifying the accuracy of the association rule. The association rules are generally expressed by a logical formula of the form "A.fwdarw.B" accompanied with support and confidence values. Where it is assumed that A and B are non-empty, independent sets of data items, the support is expressed as the percentage of records including subset "A U B" which belong in both elements of subsets A and B out of the total number of records. The confidence is expressed as a ratio of records simultaneously including A and B to records including A. In the above-mentioned example of "bread.fwdarw.milk", if the percentage of customers who purchase bread is 20% and the percentage of customers who purchase both bread and milk is 12% out of all sales receipts (records number), the support of the association rule "bread.fwdarw.milk" is 12% and the confidence thereof is 60% (=12% /20%).
The conventional data mining apparatus sets lower threshold limits for support and confidence values when generating association rules, and discovers all association rules which exceed the lower threshold limits of both the support and confidence. A method for discovering the association rules is disclosed in detail, for example, in Laid-open Japanese patent publication No. 8-263346 or in Laid-open Japanese patent publication No. 8-287106. In the former patent publication, the apparatus initially generates association rule candidates which exceed the lower threshold limit of the support. This association rule generating step is disclosed in the latter patent publication No. 8-263346 in detail. Then, the apparatus examines the confidence of the association rule candidates, uses the candidates which exceed the lower threshold limit and outputs them as final association rules. In other words, the association rules obtained by this method are discovered based only on support and confidence. Therefore, other evaluation criteria, for example, contribution to sales or other user goals are not considered.
The number of association rules obtained as a result of such a data mining system is generally large. Further, most of the obtained association rules are not the rules which user wish to find or are meaningless. Therefore, the user has to discover useful rules which fit into his purpose from the large number of association rules.
In order to solve this problem and to discover only useful association rules for the user, it is necessary to use criteria for evaluating the usefulness of the association rules. For example, in the laid-open Japanese patent publication No. 8-77010, the evaluation criterion of an association rule is calculated by a cover ratio (corresponding to the above-mentioned support), expressed by the number of records in which the association rule holds and a hit ratio (corresponding to the above-mentioned confidence) expressed by the correct answer ratio of the association rule.
"A Visualization Method for Association Rules" by Takeshi Fukuda and Shinichi Morishita, technical report of The Institute of Electronics, Information and Communication Engineers, 1995-05, pp. 41-48, discloses a method to eliminate the "uninteresting association rules", namely, a method to narrow the unuseful association rules by statistically evaluating the support and the confidence.
The conventional data mining apparatus uses the support and the confidence as the evaluation criterion of the usefulness of the association rules. In other words, the association rules which have high generality (high support) and high accuracy (high confidence) are deemed useful association rules. Such evaluation criteria are effective for assessing the value of an association rule when the goal is simply to accurately express features of the stored data.
However, data mining is not only used for such a purpose, and usually is used for the purpose of decision-making and strategy and so on. If the association rules obtained by data mining are applied for certain purposes, for example, if the association rules obtained by the basket analysis stated above are applied to a sales promotion strategy, the association rules with high support and confidence are not always highly useful for the user's purpose (i.e., increasing sales). In this case, the association rule which is highly useful for the user's purpose is, for example, an association rule that can be relied upon to increase sales.
In this way, generally speaking, the value of association rules may vary depending on how the user intends to use the association rules. The uniform evaluation criteria of support and confidence used in the conventional data mining system do not always accurately evaluate the association rules relative to the user's purpose. Since the value of the association rule is evaluated based only on the support and the confidence in the conventional art, if the data mining is carried out to learn how much sales promotion can be achieved by using the association rule or to highlight the association rule which could be used to predict large profits, there occurs a problem that the association rule cannot be evaluated for such purposes since the support and the confidence have little to do with anticipating income.