1. Field of the Invention
The present invention relates to data mining technology. More particularly, it relates to the area of mining for association rules and/or sequential patterns within data assets.
2. Description and Disadvantages of Prior Art
Over the past two decades there has been a huge increase in the amount of data being stored in databases as well as the number of database applications in business and the scientific domain. This explosion in the amount of electronically stored data was accelerated by the success of the relational model for storing data and the development and maturing of data retrieval and manipulation technologies. While technology for storing the data developed fast to keep up with the demand, little stress was paid to developing software for analyzing the data until recently when companies realized that hidden within these masses of data was a resource that was being ignored. The huge amounts of stored data contains knowledge about a number of aspects of their business waiting to be harnessed and used for more effective business decision support. Database Management Systems used to manage these data sets at present only allow the user to access information explicitly present in the databases i.e. the data. The data stored in the database is only a small part of the xe2x80x98iceberg of informationxe2x80x99 available from it. Contained implicitly within this data is knowledge about a number of aspects of their business waiting to be harnessed and used for more effective business decision support. This extraction of knowledge from large data sets is called Data Mining or Knowledge Discovery in databases and is defined as the non-trivial extraction of implicit, previously unknown and potentially useful information from data. The obvious benefits of Data Mining has resulted in a lot of resources being directed towards its development.
Data mining involves the development of tools that analyze large databases to extract useful information from them. As an application of data mining, customer purchasing patterns may be derived from a large customer transaction database by analyzing its transaction records. Such purchasing habits can provide invaluable marketing information. For example, retailers can create more effective store displays and more effective control inventory than otherwise would be possible if they know consumer purchase patterns. As a further example, catalog companies can conduct more effective mass mailings if they know that, given that a consumer has purchased a first item, the same consumer can be expected, with some degree of probability, to purchase a particular second item within a particular time period after the first purchase.
Data mining uses several techniques to find pieces of knowledge in large amounts of data. Two of these techniques are the so-called mining for association rules and the mining for sequential patterns.
Identifying association rules from a large database of transactions is an essential part of data mining. An association rule is an expression of the form Xxe2x86x92Y, where X and Y are sets of items. In the retail domain, the data to be mined typically consist of transactions, where each transaction is characterized by a set of items. For example, the database may contain customers"" sale transactions on shoes and jackets. A possible association rule may be of the form xe2x80x9c30 percent of transactions that contain jackets also contain shoes; 10 percent of all transactions contain both shoes and jacketsxe2x80x9d. The 30 percent value is referred to as the confidence of the rule, while the 10 percent value is the support of the rule. The task of mining association rules involves finding all the association rules from the transactions that satisfy certain user-specified minimum support and confidence constraints.
Conceptually, the problem may be viewed as finding the association rules from a relational table of records. Each record may represent a transaction, as in the case of a retail transaction database, or other data items in the database. Each record has one or more attributes where each attribute corresponds to an item of the transaction.
Another essential part of data mining relates to identification of sequential pattern. This involves rules that are based on temporal data. Suppose we have a database of natural disasters. From such a database if we conclude that whenever there was an earthquake in Los Angeles, the next day Mt. Kilimanjaro erupted, such a rule would be a sequence rule. Such rules are useful for making predictions which could be useful in making market gains or for taking preventive action against natural disasters. The factor that differentiates sequence rules from other rules is the temporal factor.
Other applications of data mining include catalog design, add-on sales, store layout, and customer segmentation based on buying patterns and many more. Typically the databases involved in these applications are very large. It is imperative, therefore, to have fast algorithms for this task.
Although several methods of mining for association rules and mining for sequential patterns have been proposed, only methods derived from the so-called APRIORI approach (see R. Agrawal, S. Rikant, Fast Algorithms for Mining Association Rules, in Proceedings of the 20th VLDB Conference, 1994) have been proven to be efficient enough to process large data volumes.
The APRIORI approach depends on a special format of the data called transaction format. In case of associations the transaction format conceptually consists of only two columns, namely a xe2x80x9ctransaction identifierxe2x80x9d and an xe2x80x9citem identifierxe2x80x9d. In case of sequential patterns conceptually it consists of three columns, namely a xe2x80x9ctransaction group identifierxe2x80x9d, a xe2x80x9ctransaction identifiexe2x80x9d, and an xe2x80x9citem identifierxe2x80x9d. A much more serious drawback of the APRIORI approach according the current state of the art is that it requires that all of the xe2x80x9citem identifiersxe2x80x9d relate to the same item type. As a result the APRIORI approach is only capable of deriving association rules or sequences between items of the same type. If for instance the item identifier relates to a certain product bought by a certain customer the APRIORI technique would be capable of deriving only rules of the form: if a customer buys PRODUCT1 then he also will buy PRODUCT2 with the probability of X%. The APRIORI approach would not be able include in its generated rules items of other types, like for instance the gender, the age, the profession, the place of residence or other aspects of the customers. It can be expected that once a multitude of different item types can be included in the process of derivation of rules the importance of the derived rules can be significantly increased as they would be much more selective in nature.
The invention is based on the objective to provide a computerized method for data mining for association rules and or sequential patterns of a multitude of records, wherein the multitude of records comprise transaction-items of different item-types.
The objectives of the invention are solved by the independent claims. Further advantageous arrangements and embodiments of the invention are set forth in the respective subclaims.
The invention relates to a computerized method for data mining for association rules and/or sequential patterns of a multitude of records. The invention is applicable to records comprising a transaction-identification and at least one transaction-item with a corresponding item-type wherein said multitude of records comprise transaction-items of different item-types. The proposed method further comprises a preprocessing-step for transforming each record into one or more transaction-records of transaction-format. According to said transaction format for each transaction-item in said record a transaction-record is generated and said transaction-record comprises at least the transaction-identification of said record and an encoded transaction-item encoding said transaction-item and its corresponding item-type into one value. Finally said method comprises a mining-step wherein a state of the art data-mining techniques is applied to said transaction-records for data mining for association rules and/or sequential patterns.
The current invention extends data mining technology according to the current state of the art and is now also supporting the mining for association rules and/or sequential patterns based on data assets comprising items of a multitude of item types. While current activities in this area of technology are concentrating on the search for new and advanced mining algorithms the current invention is able to achieve this goal by features pointing in a completely different and surprising direction. Instead of proposing a new mining algorithm the current invention suggests a new pre-processing step which transforms the data to be mined into a new encoding scheme. The usage of multiple fields to be defined as item fields for efficient mining for association/sequential patterns is supported without a need to introduce a new algorithm because data is not in transaction format. Thus mining algorithms proved to be very efficient and optimized during the last years are still applicable.