This invention relates generally to data mining, and more specifically, to methods and framework for constraint-based activity mining (CMAP).
In many applications, including military surveillance, scientific data analysis, manufacturing processes, and business intelligence, human and/or machine activities have been recorded and analyzed. Often, discovery of recurrent patterns from such activities will provide invaluable insights and enable effective actions in these application domains.
Early studies, such as association rule mining and its variations, often assume that all data is stored in a single data table and hence ignore any complex structures among the data. Recently, the data mining community has recognized the need to discover patterns from multiple relational tables, since many datasets have been saved in relational databases for a long time.
Different approaches have been taken to discover such patterns. A generic approach is to consider a pattern as a logic clause. However, it is very challenging to develop an efficient algorithm to discover patterns of such generic form. Researchers have addressed this challenge by restricting the allowed forms of patterns and designing special algorithms to discover patterns of the restricted form.
A very common technique is to use “mode”, first introduced in PROGOL. This often reduces the valid pattern space significantly, for example, up to a few orders of magnitude. However, a mode satisfaction test often depends on the order of pattern elements, which is usually not significant in determining the semantics of a pattern. For example, a mode on a close predicate can require any atom on close to use only variables introduced in predicates preceding it. This mode is satisfied by Example 2, but not by Example 3, (See Table 1 below) even though the two patterns are equivalent. Hence, the mode restriction must be carefully taken into account when designing the mining algorithms.
Often, the mode declaration also specifies the data type for predicate arguments. This data type specification further reduces the pattern search space. For example, the same variable in a pattern cannot assume two different types. In addition, a constant parameter in a pattern should have the correct type as specified by the corresponding predicate.
One known data mining tool, WARMR, adapts a mode constraint to discover frequent patterns in first-order logic form. Specifically, a pattern is a conjunction of positive literals. Even though WARMR exploits the mode constraints, which is sometimes referred to as bias, it does not perform or scale well due to the generic pattern formulation. On the other hand, there are some other specialized algorithms, e.g., to discover sequential patterns which are similar to Example 5 (if the starting and ending time of each action are collapsed into a single time point), or to discover sub-graphs which are similar to Example 1 (if airport is considered as node and fly as edge in a graph). Such specialized algorithms may be reasonably efficient, but the forms of restrictions are built into the algorithms, and cannot be extended to handle other forms.
WARMR first introduces the concept of a multi-relational activity pattern, called “query”, as an extension to association rules. It uses a level-wise refinement framework similar to the APRIORI algorithm used in association rule mining. The major difference in the WARMR algorithm is in generation of new candidate patterns from existing ones. It uses the typical logic refinement operation: unify two variables, replace one variable by a constant, or add a new atom (a pattern element).
WARMR requires that atoms must satisfy mode constraints which in turn restricts the use of constants or variables in each argument. This helps to reduce the search space significantly. Unfortunately, due to the intrinsic large search space (note that an association rule is a degenerated pattern using atoms of a single predicate with a single constant argument), WARMR can only handle very small data sets (with small number of predicates and data records).
Another data mining tool, FARMER, improves WARMR with a more efficient algorithm. However, its assumption of unique object identifiers (OIs) in patterns prevents FARMER from discovering many interesting patterns. An improved version of FARMER relaxes the OI assumption (to a weak OI assumption) but sacrifices the efficiency of the original algorithm to some extent. However, even the weak-OI assumption is overly restrictive. For instance, and again referring to Table 1, Example 7 cannot be discovered by FARMER under the OI assumption. Example 7 and Example 8 cannot be discovered at the same time using FARMER under the weak OI assumption (since R has to be specified as OI to discover Example 8 and reverse the specification to discover Example 7). In addition, WARMR or FARMER does not consider other constraints.
FARMER improves WARMR by significantly reducing the number of generated candidate patterns, under the assumption that variables in a pattern are object identities (OIs). Two variables in the same pattern cannot take the same value assignment, and a variable cannot be assigned to a constant in the same pattern. For example, in pattern occur(E1, T1), occur(E2, T2), close(T1, T2), the occurrence time of the two events T1 and T2 cannot be the same (even though they shall be close to each other) due to the OI requirement. With this OI assumption, FARMER uses a much simpler refinement operation, that is, FARMER always adds a new atom to the existing pattern in order to obtain a candidate pattern. Combined with the mode constraints, the refinement step in FARMER will generate much less redundant patterns than the WARMR algorithm. In addition, although not explicitly mentioned, FARMER assumes that all atoms in the pattern are “connected” through common variables shared by atoms. It actually exploits this assumption to add atoms that must use one variable in the existing pattern.
TABLE 1Example 1A person takes a round-trip flight:Here, fly provides one hopperson(P)→fly(P, A1 T1, A2, T2),flight information includingfly(P, A2, T3, A1, T4), A1 ≠ A2.the person who flies, thedeparture airport and time,as well as the arrival airportand time.Example 2A person takes a connected flight:person(P)→fly(P, A1, S1, A2, E1),fly(P, A2, S2, A3, E2), close(E1, S2), A1 ≠ A3.Example 3A person takes a connected flight:This pattern is equivalent toperson(P)→close(T1, T2),Example 2, since the orderfly(P, A1, T2, A2, T3), fly(P, A3, T4, A1, T1),of the pattern elements orA1 ≠ A3.the name of the variablesdoes not change the meaningof the pattern.Example 4A person takes a connected flight:This pattern is equivalent toperson(P)→fly(P, A2, S2, A3, E2),Example 2, since close is afly(P, A1, S1, A2, E2), close(S2, E1), A1 ≠ A3.symmetric relation, i.e.,close(S2, E1) is the same asclose(E1, S2).Example 5A person takes a sequence of actions:Here L is the location inperson(P)→fly (P, A1, S1, A2, E1),drive and lodge.drive(P, L1, S2, L2, E2), lodge(P, L3, S3, E3),before(E1, S2), before(E2, S3).Example 6A frequent flier: person(P)→fly(P, S, E),Here max and min return thecount(S)/(max(S) − min(S) + 1) > 6.maximum and minimumyear for all the flights(starting time) a persontakes, and count returns thenumber of distinct flights(each has a different startingtime).Example 7A flight stops in two metropolitanNote that the twoairports: flight(F) →stop(F, A1),metropolitan regions can beairport(A1, R1), region(R1, ‘metro’),the same or different in thisstop(F, A2), airport(A2, R2),pattern.region(R2, ‘metro’), A1 ≠ A2.Example 8A flight stops in two airports of twodifferent metropolitan regions: flight(F)→stop(F, A1), airport(A1, R1),region(R1, ‘metro’), stop(F, A2),airport(A2, R2), region(R2, ‘metro’), A1 ≠A2, R1 ≠ R2.