The present invention relates in general to data mining, and in particular to finding deviations in data.
Data mining refers in general to data-driven approaches for extracting information from input data. Other approaches for extracting information from input data are typically hypothesis-driven, where a set of hypotheses is proven true or false in view of the input data.
The amount of input data may be huge, and therefore data mining techniques typically need to consider how to effectively process large amounts of data. Consider manufacturing of products as an example. There, the input data may include various pieces of data relating to origin and features of components, processing of the components in a manufacturing plant, how the components have been assembled together. The aim of data mining in the context of manufacturing may be to resolve problems relating to quality analysis and quality assurance. Data mining may be used, for example, for root cause analysis, for early warning systems within the manufacture plant, and for reducing warranty claims. As a second example, consider various information technology systems. There, data mining may further be used for intrusion detection, system monitoring and problem analyses. Data mining has also various other uses, for example, in retail and services, where typical customer behavior can be analyzed, and in medicine and life sciences for finding causal relations in clinical studies.
Pattern detection is a data mining discipline, where the input data consists of sets of transactions and each transaction consists of a set of items. The transactions may additionally be ordered. The ordering may be based on time, but alternatively any ordering can be defined. For example, each transaction may have been given a sequence number. Association rules are patterns describing how items occur within transactions. Sequence rules, on the other hand, refer to a certain sequence of item sets in sequential transactions.
Consider a set of items I={I1, I2, . . . Im}. Let D be a set of transactions, where each transaction T is a set of items belonging to I, T⊂I. A transaction T thus contains a set A of items in I if A⊂T. An association rule is an implication of the form AB, where A⊂I, B⊂I, and AIB=Ø; A is called the body and B the head of the rule. The association rule AB holds true in the transaction set D with a confidence c, if c % of the transactions in D that contain A also contain B. In other words, the confidence c is the conditional probability p(B|A), where p(S) is the probability of finding S as a subset of a transaction T in D. The rule AB has support s in the transaction set D, when s % of the transactions in D contain AYB. In other words, the support s is the probability of the union of items in set A and in set B occurring in a transaction.
The aim in data mining is in general to accurately find all association rules and sequence rules meeting user defined criteria. Often the user defines a minimum support or confidence for the rules, as very rare or loosely correlated events may not be of importance for some applications. The user may also be interested only in particular items and wants to search only for patterns containing at least one of these interesting items.
In some cases, however, it is important to find irregularities or deviations in input data. For example, finding irregularities or deviations is needed for cleansing data or for detection of unusual behavior which can be an indicator for fraud. Search for irregularities is typically based on finding regularities first, and then detecting deviations from the regularities. For example, patterns that have a very high confidence are interpreted as regularities. Data records that are not in accordance with the high confidence patterns are then interpreted as exceptions. A further example of finding irregularities is disclosed in the U.S. Pat. No. 6,954,756 where a known technique for generating classification trees is used. A classification tree and underlying classification rules are generated, for example based on training data. The data to be analyzed is then classified into using the classification tree. A purity value is a measure of the degree of conformity of all records associated with a leaf node (certain class). Records associated with a leaf node that has high purity but not in accordance with the underlying classification rule are the interpreted as exceptions.
Validation of candidate patterns against the input data is very resource intensive. In both examples above, a minimum threshold for confidence or purity is needed for limiting the number of pattern candidates among which the high confidence/purity patterns are searched for, in order to limit the amount of computing resources and/or time needed for the calculations. It is thus not possible to find exceptions to patterns whose confidence/purity is not above the threshold. Furthermore, exceptions to patterns with disjunctive rule heads cannot be tracked using the above methods based on high purity/confidence. Disjunctive rule head means that a disjunction of several items appears on the right hand side, e.g. A(B1 or B2 or B3).
The existing methods for finding deviations or irregularities thus cannot detect all possible deviations or irregularities. Thus, it would be desirable to provide improved mechanisms for efficiently detecting irregularities in data.