The present invention relates in general to data mining. In particular, the present invention relates to input data structure for transactional information.
Data mining refers in general to data-driven approaches for extracting information from input data. Other approaches for extracting information from input data are typically hypothesis driven, where a set of hypotheses is proven true or false in view of the input data.
The amount of input data may be huge, and therefore data mining techniques typically need to consider how to effectively process large amounts of data. Consider manufacturing of products as an example. There, the input data may include various pieces of data relating to origin and features of components, processing of the components in a manufacturing plant, how the components have been assembled together. The aim of data mining in the context of manufacturing may be to resolve problems relating to quality analysis and quality assurance. Data mining may be used, for example, for root cause analysis, for early warning systems within the manufacture plant, and for reducing warranty claims. As a second example, consider various information technology systems. There, data mining may further be used for intrusion detection, system monitoring and problem analyses. Data mining has also various other uses, for example, in retail and services, where typical customer behavior can be analyzed, and in medicine and life sciences for finding causal relations in clinical studies.
Pattern detection is a data mining discipline, where the input data are sets of transactions where each transaction includes a set of items. The transactions may additionally be ordered. The ordering may be based on time, but alternatively any ordering can be defined. For example, each transaction may have been given a sequence number. Association rules are patterns describing how items occur within transactions. Sequence rules, on the other hand, refer to a certain sequence of item sets in sequential transactions.
Consider a set of items I={I1, I2, . . . Im}. Let D be a set of transactions, where each transaction T is a set of items belonging to I, T⊂I. A transaction T thus contains a set A of items in I if A⊂T. An association rule is an implication of the form AB, where A⊂I, B⊂I, and AI B=Ø; A is called the body and B the head of the rule. The association rule AB holds true in the transaction set D with a confidence c, if c % of the transactions in D that contain A also contain B. In other words, the confidence c is the conditional probability p(B|A), where p(S) is the probability of finding S as a subset of a transaction T in D. The rule AB has support s in the transaction set D, when s% of the transactions in D contain AYB. In other words, the support s is the probability of the union of items in set A and in set B occurring in a transaction.
The aim in data mining is in general to accurately find all association rules and sequence rules meeting user-defined criteria. The user may define a minimum support or confidence for the rules, as very rare or loosely correlated events may not be of importance for some applications. The user may also be interested only in particular items and wants to search only for patterns containing at least one of these interesting items.
There are many techniques for determining association rules and sequence rules based on input data. Typically, search for association rules and sequence rules is based on generation of candidate patterns, which are then evaluated with respect to the input data. Those candidate patterns, which are found to be suitable, are then extended by adding new items to the rule, resulting in new more complex candidate patterns.
As the amount of input data may be huge and the patterns may be complex, there is need to efficiently organize the search through the candidate pattern space and evaluation of candidate patterns in view of the data. The existing techniques may be classified in two classes of algorithms based on the way these techniques proceed through the candidate pattern space. Some filter criteria apply immediately, for example if a defined minimum support is not reached, because these filter criteria are inherited by child patterns. Others, such as the minimum confidence, can only be applied to complete rules, which impede their early application.
The first class of algorithms is the breath-first search. In these algorithms, the search through the candidate pattern space is started from simple patterns having two items. All two item patterns are first generated and evaluated with respect to the input data. Then all three item patterns are generated and evaluated with respect to the input data. Typically each candidate pattern is evaluated against the input data transactions. Unevaluated candidate patterns are typically stored in memory. The input data, on the other hand, is typically not stored in the memory but it is read from the data source. An example of breadth-first search can be found in “Fast Algorithms for Mining Association Rules” by Rakesh Agrawal and Ramakrishnan Srikant, Proc. 20th Int. Conf. Very Large Data Bases (VLDB), 1994.
The second class of algorithms is the depth-first search. In these algorithms, sets of candidate patterns are evaluated by starting from a first seed candidate pattern and evaluating all its siblings before turning to the other candidate patterns. As an example of a depth-first search algorithm, consider the algorithm described in “Sequential pattern mining using a bitmap Representation” by Jay Ayres et al., Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 429-435. In this algorithm, the input data is converted into binary format and stored in memory. Active-data-record histories, which are used to maintain information about which data records (transactions) are relevant for a certain pattern, are also kept in the memory.
These known data mining algorithms have drawbacks in certain situations. Depending on the amount of input data (especially on the average size of transactions) and on the size of the candidate pattern space, the breadth-first search may be slow since many scans on the original data source are needed and since each candidate pattern needs to be evaluated against all transactions. The depth-first search, on the other hand, may run out of memory for large amounts of input data, or—because of the large number of evaluations against the input data—it may be slow when the input data is swapped to the disk.
Evaluation of candidate patterns with respect to the input data forms the core of data mining techniques designed to find patterns. The input data is accessed repeatedly for the evaluation of candidate patterns. Some existing solutions do not perform any pre-processing of the input data; this means that candidate patterns are evaluated with respect to the original input data. An example of this approach is the A-Priori algorithm, discussed in “Fast Algorithms for Mining Association Rules” by Rakesh Agrawal and Ramakrishnan Srikant mentioned above. Some methods pre-process the input data, for example, by replacing original item names, which may be text strings or many-digit integers, by smaller integers. An example of this approach is discussed “Sequential pattern mining using a bitmap presentation” by Jay Ayres et al. mentioned above. Input data processed in this way consumes somewhat less storage than raw input data.
Currently, there are no efficient solutions for compressing input data. Compressed input data would require less storage space, and could thus allow larger input data amounts to be subject to data mining. However, compression of input data may cause difficulties in evaluation of the candidate patterns. There is thus need for an input data format that overcomes at least some of the above mentioned problems.