1. Field of the Invention
The present invention is generally related to systems and processes of analyzing transactional database information to mine data item association rules and, in particular, to a system and method of backlinking reinforcement analysis of transactional data to establish emergent weighted association rules.
2. Description of the Related Art
Data mining systems and tools are utilized to determine associative relationships within data as contained in typically large-scale information databases. Where the source information represents, for example, commercial transactions conducted with respect to discrete items, association relationships between different items can be determined by analysis with relative degrees of accuracy and confidence. These association relationships can then be utilized for various purposes including, in particular, predicting likely consumer behaviors with respect to the set of items covered by the transaction data. In practical terms, the presentation and substance of product designs, marketing campaigns and the like can then be tailored efficiently to reflect consumer interest and demand.
Conventionally, the relationships mined from transactional information databases are collected as association rules within a reference database, generally referred to as an expert database. Each association rule is qualified, relative to the items in the relation, with a weight representing the significance or strength of the association between the items. A collected set of association rules can then be used to provide solutions to various problems presented as query assertions against the expert database. In conventional implementation, a relational trace through the expert database, discriminating between various relationship branches based on the associated relative weightings, allows a query to be resolved to a most highly correlated solution set of related items. The query itself may be represented as an identified item, item set, or attributes that are associated with the items identified within the expert database.
Automated association mining techniques, as opposed to manual processes of knowledge engineering used to create expert databases, are preferred particularly where the volume of data to be evaluated is large and where the usefulness of the mined associations degrades rapidly over time. Conventional automated association mining analysis techniques, however, are subject to a variety of limitations. In particular, the automated techniques tend to identify associations exponentially with the number of items identified within the transaction data. The performance of queries against an expert database naturally degrades with increases in the database size. Furthermore, many of the association rules generated may be irrelevant to the defined or even likely queries that will be asserted against the expert database.
Another problem is that variations in the underlying transactional data may affect the relative quality of the potential associations. The analysis determined strength of the associations identified may be distorted by the number of times particular items are identified in the transactional data and by the distribution of the items within the larger set of transactions. Thus, the confidence in the determined strengths of the relationships identified by the automated analysis can vary significantly.
In conventional systems, association rules are generated through an algorithmic processing of a transaction data record set representing, for example, a series of commercial transactions. Depending on the nature of the source transactional data, item associations are initially identified based on the rate of occurrence of unique item pairings or, where a transaction involves multiple items, sets of items. The occurrence rate for a specific item set within the set of transaction data records is conventionally referred to as the item set support. As described in xe2x80x9cMining Association Rules between Sets of Items in Large Databasesxe2x80x9d by Agrawal, Imielinski and Swami, Proc. of the 1993 ACM SigMod Conf. on Management of Data, May 1993, pp. 207-216, a minimum support threshold can be established to discriminate out insignificant item sets. As described there, the threshold support value is empirically selected to represent a statistical significance determined from business reasons. In the example provided, the threshold minimum support value was set at 1%. Association rules having a support less than the threshold support value, representing associations of less than minimal significance, are discarded.
The Agrawal article also describes the use of syntactic constraints to reduce the size of the generated expert database. The items that are of interest for queries or, conversely, the items that are not of interest may be known in advance of rule generation. A corresponding constraint on the generation of association rules is implemented in the algorithmic examination of transaction data records with the result that only association rules of interest are generated and stored to the expert database.
Finally, the Agrawal article describes a technique for assessing the confidence of the strength of association rules. The technique presumes that, in discovering the solution set for a query, the relative validity of rule strengths in the solution paths can be normalized based on the relative representation of association rules within the transaction data set. The conventional calculation of confidence for a given association rule, as presented by Agrawal, is the fraction of source transaction data records that support the association rule. That is, the confidence C of an association rule XI, where X is an item set identified within a transaction data set T and I is a single item not in X, is the ratio of the support of XI divided by the support of X.
The confidence determined for an association rule is used in the Agrawal article can be used as a threshold value for qualifying generated association rules for inclusion in the expert database. Association rules with a confidence level exceeding some defined minimum value are, in effect, deemed minimally reliable. The determination of the threshold confidence level is again empirical, based generally on an evaluation of the statistical insignificance of the rules excluded.
The support and confidence values determined for the minimally relevant and reliable association rules are conventionally stored with the corresponding rules within the expert database. Subsequent evaluation of queries against the expert database can utilize these support and confidence values, in part, to determine the optimal solution sets. U.S. Pat. No. 6,272,478, issued to Obata et al., describes the generally similar application of assigned evaluation values for association rules. Specifically, cost and sales values are assigned as attributes to association rules to permit evaluation of additional criteria in determining an optimal set of association rules to use in reaching a solution set for an applied query. The evaluation of these additional criteria permit, for example, selection of solution sets that optimize profitability. Where multiple items are specified in the antecedent and consequent terms of an association rule, mathematical formulas corresponding to the included item sets are used in the evaluation of the association rule. While the evaluation values and formulas may be stored in an item dictionary provided with the expert database, the evaluation values and formulas are derived independent of the support and confidence values.
The generation of an expert database with associations having defined minimum relevancy and reliability enables broad query assertions to be adequately resolved to solution sets of at least equal minimum relevancy and reliability. Any progressive evaluation of the support and confidence values of association rules applied in determining a solution set can be used to raise and change the minimum relevancy and reliability of the solution set reached. Furthermore, the additional consideration of independent evaluation criteria enables targeted factors to be considered in determining the ultimate solution set for an applied query.
The evaluation of additional, separately supplied information correlated to the transaction items thus permits the generated association rules set to be evaluated for a specific purpose. The accuracy and reliability of any solution sets generated, however, remains limited largely to the accuracy and reliability of the underlying association rules as a whole. Relationships potentially reflected in transaction data and that meet the minimum support and confidence criteria used by conventional mining techniques may not be substantially differentiated by conventionally derived association rules in any meaningful manner. Therefore, conventionally generated expert databases are thereby limited in the quality and extent of the information that can be derived from the databases.
Consequently, there is a need to provide for the automated generation of expert databases that supports degrees of accuracy and reliability well discriminated beyond the limits of the minimum support and confidence criteria used by conventional mining techniques.
Thus, a general purpose of the present invention is to provide an efficient system and methods of generating expert databases that can be used to support decision processes with a high and well-discriminated degree of accuracy and reliability.
This is achieved in the present invention by a system and methods that provide for the evaluation of transaction data records to first determine forward link associations between items as reference and related items identified by corresponding xe2x80x9cexpertxe2x80x9d users as a basis for establishing expert database item association rules. The forward link associations are then evaluated to identify back link associations between reference and related items. Back link weights, corresponding to the respective back link associations reflecting the depth and strength of the back linked associations are determined and associated with the forward link associations to provide an augmented basis for subsequently evaluating the association rules collected into an expert database.
An advantage of the present invention is that the association rules used to construct an expert database provide a greater degree of reliability and accuracy in the solutions sets obtained from queries against the expert database. Experts are preferably identified on a per-reference item basis, directly enabling identification of associations of high predictive significance.
Another advantage of the present invention is that identification of back linked relationships through chains of relevant xe2x80x9cexpertxe2x80x9d user sub-populations enables a direct reinforcing of associations of high predictive significance. The relative reinforcement of associations is used to increase the predictive weight of the corresponding association rules of the expert database.
A further advantage of the present invention is that the system and processes of generating sets of association rules for expert databases are autonomous based on an established set of analysis parameters. Preferably, these analysis parameters set thresholds for the detailed analysis procedure largely based on relatively non-critical empirical examinations of the source transaction data records, the nature of the items transacted, and the number of users identified in the pool of transaction data records.