The invention relates to a method for identifying patterns, and in particular for identifying repeating patterns in sequential event streams.
Basic Terminology
We consider a stream of events, which are ordered according to some numerical criterion such as time, spatial position or relative proximity. Events are typed (i.e. each event is of a predetermined type), so that the event stream may be represented as a sequence of events each identified by a type identifier, or type-id, and a coordinate identifying the position of the event in the sequence; the event is said to occur at that position, which may for example be a time or a spatial position, or simply represent the ordering of the event relative to other events in the stream.
The following example represents a sequence of seven ordered events of four distinct types A, B, C and D, in which each event is represented as a ( less than type-id greater than ,  less than position greater than ) pair.
((A, 6), (B, 7), (C, 9), (D, 29), (A, 44), (B, 47), (C, 48))
Notice that the type A, B and C events occur two times in the sequence, and that on both occasions their coordinates indicate that they are relatively close together. In the example, we use an integer to represent position, but any numerical value may be used.
A repeating pattern, or simply a pattern, is defined as a set of types whose corresponding events repeatedly occur at around the same position. When these events do occur at around the same position, we say that the pattern occurs or is instantiated at that position, and that the events form an occurrence or instance of the pattern. Referring again to the previous example, if the three types A, B, and C constitute a pattern, then the event stream can be said to illustrate two instances of that pattern.
Previous approaches to the problem of searching data for repeating patterns have generally aimed to identify repeating temporal patterns, and have used the principle of moving a time window stepwise through the data and identifying events in the window at each position to reduce the problem to one of itemset counting (reference 1). Each position of the time window is thus used to define an itemset, in which each item is an event type. The itemsets are then processed through a standard itemset counting algorithm (reference 2). The frequent itemsets (itemsets occurring more than a specified number of times) are then declared to be the frequent repeating patterns. This approach has a number of drawbacks:
1. Successive positions of the moving time window overlap, leading to multiple counting of neighbouring event types. (If successive window positions do not overlap, pattern instances are too easily missed).
2. The frequent itemsets generally exhibit a complex pattern of nesting and overlap, which can conceal the true nature of the patterns in the data stream.
3. The itemset counting algorithms require the user to specify a minimum frequency for the patterns to be found. Setting this parameter is extremely difficult in practice. Setting it too high results in patterns being missed. More importantly, setting this parameter too low results in a huge increase in the amount of computer time and memory required to perform the count.
4. Event streams containing several comparatively infrequently repeating patterns present particular problems, because of the need to use a small frequency threshold to catch such patterns. Itemset counting algorithms perform poorly when searching for low frequency patterns.
The invention provides a method and a computer program for identifying repeating patterns in an event stream as defined in the appended independent claim, to which reference should now be made. Preferred or advantageous features of the invention are set out in dependent subclaims.
In a preferred aspect, the invention therefore provides a method for the detection of repeating patterns in sequential typed event streams, which advantageously makes use of the recognition that repeating patterns can be identified as clusters in an edge-weighted graph derived from a sliding or moving window scheme, together with the manner in which the edge weights are derived from the windows themselves.
Event types that form a repeating pattern tend to occur more frequently within a short distance of one another than with other event types. In a preferred embodiment, each event type is represented by a vertex in an edge-weighted graph. Our method then detects repeating patterns by examining the contents of a narrow fixed width window as it is moved, preferably in a single pass, through the event stream, and calculating edge weights in the graph to reflect the frequency with which types (vertices) co-occur in the moving window. Groups of types (vertices) that co-occur frequently emerge as clusters of higher edge weights in the graph, and can be detected by any standard graph clustering method.
One option would be to use a cluster detection method based on a graph connectivity criterion, such as the method set out in Appendix 1 herein. See reference 3 for a definition of connectivity and any other graph-theoretic terms used in this document.
Once clusters, or patterns, of types have been identified, pattern instances may advantageously be located by re-scanning the event stream for each pattern in turn. For the rescanning process, the event stream is preferably edited to retain only those events corresponding to types in the pattern under investigation, to reduce processing time. As the window passes through these events, the pattern instances emerge as non-overlapping sets of contiguous pattern events.
A pattern instance is said to be complete if every type in the pattern is represented by a corresponding event. A pattern instance is said to be partial if some of the types are not so represented. The coverage of a pattern instance is defined as the percentage of the types in the pattern represented by events. The frequency of a pattern is defined as the number of times a pattern is instantiated in a sequence.
It will be appreciated that the method described above may advantageously detect partial instances of patterns. In a preferred embodiment of the invention, a coverage threshold may be specified, for example by a user, and pattern instances with a coverage below the threshold may be rejected. This enables a user to search for and identify partial patterns in a controlled way.
The choice of window width is preferably under the user""s control, and may depend on the nature of the event data being analysed. For example, for telephone call data where a search is being made to identify telephone calls between individuals which may be linked, a window width of half an hour to an hour might be appropriate. For financial data, where a search might be made for financial transactions which may be linked, the width might be greaterxe2x80x94one to two days perhaps.
In some applications, events of different types may occur at markedly different rates. Consider, for example, an event stream consisting of a mixture of telephone call events and postal delivery events. Typically the telephone calls will be occurring frequently, whereas postal deliveries occur only once or perhaps twice a day. In order to capture a pattern of the form (Package Delivered to A, A calls B, A calls C), the following preferred embodiment of the invention may be used.
This embodiment allows sub-windows of different widths to be assigned to different types of event. The sub-window widths are chosen to match the characteristics of the event types to which they are applied. Also, separate event streams are prepared, each containing only events of predetermined types, but the corresponding edge-weighted graph contains vertices corresponding to all event types. The sub-windows are then moved simultaneously, in parallel, through their own event streams. The contents of the two sub-windows are treated as if they come from one window for the purposes of weighting the edges in the graph.
With the example of a mixture of telephone call events and postal delivery events mentioned above, we might choose to use two sub-windows, one set to a width of one day to capture postal delivery events and the other set to a width of one hour to capture telephone call events. If the telephone window at a particular position contains events of types A and B, while the postal sub-window at the same position contains an event of type C, then edge weighting in the graph would be applied to indicate not only a link between A and B but also links between A and C and between B and C.
The sub-windows of a window may be aligned in various configurations. One possibility is a central alignment, in which sub-windows share a common mid-position. Alternatively sub-windows may be aligned at either end, or at some other position. Sub-window alignment may be selected depending on the type of data being investigated. For instance, in the example given above it may be desirable to search for patterns involving postal events following telephone events (Mr A phones Mr B, to tell him to post a package to Mr C). Aligning the start times of the two sub-windows may best achieve this.
It will be appreciated that in this embodiment it is not necessary to prepare multiple event streams as described, but that an equivalent effect would be obtained by moving two or more parallel sub-windows through the same event stream and counting within each sub-window in each position only events of types appropriate to that sub-window.
In summary, a preferred embodiment of the invention may advantageously use the following steps.
1. Select events for analysis, and sort into sequential order.
2. Identify the unique set of event types defined by the events.
3. Classify the event types for sub-window assignment (multiple sub-window case only).
4. Select a window width (or widths for the multiple sub-window case) appropriate to the event data under analysis.
5. Create a graph with vertices representing each event type, and with no edges (or edge weights) initially.
6. Move the window through the event stream, and add weighted edges to the graph accordingly.
7. Locate clusters in the final edge-weighted graph. The types in these clusters define the patterns found by the method.
8. For each pattern in turn, edit the data stream by selecting the events whose underlying types are in the pattern. These edited data streams are termed pattern event sets.
9. Locate pattern instances by moving the window through the pattern event sets and identifying non-overlapping windows where pattern events are present.
10. Reject instances where the coverage is below the requested threshold level.
11. Output the final set of patterns and pattern instances.
Pattern Properties
Our method is driven by the occurrence of event types close together, and is not sensitive to the precise order in which the events occur in the event stream. Hence a pattern of the form (A, B) does not necessarily imply that A events always precede B events in the pattern. The only implication is that these event types are seen close together. This is a significant benefit because it provides a degree of stability to the pattern finding method. The precise order, and any variations from this order, can be studied by viewing the pattern instances in an event sequence display, i.e. in the results generated by the method.
The patterns found by the method are derived from a clustering of the position-independent weighted graph generated from the sequentially-ordered event stream. Because of this, a certain amount of smoothing is imparted to the final patterns. Suppose, for example, that event type pairs (A, B), (B, C) and (A, C) are captured by several distinct windows during the initial scan of the event stream. The edge weighting scheme will attach high weight to the edges linking the vertices A, B, and C so that the graph clustering method will identify (A, B, C) as a cluster, or pattern. Nevertheless, there may be few if any positions in the event stream where A, B, and C all co-occur in close proximity. This smoothing is a significant benefit of the method. A conventional combinatorial method based on the direct counting of co-occurrence patterns would not be able to perform this smoothing. It is because of this smoothing that not all pattern instances may exhibit 100% coverage.
Because our method uses a connectivity-based graph clustering technique to identify patterns (clusters), it is able to identify relatively infrequent patterns as well as frequently occurring ones. This is because even relatively infrequent patterns generate regions of locally raised connectivity in the weighted graph, and these are recognised as clusters by the clustering method.
The method of the invention may advantageously be applied directly to any form of sequential event stream. Specific examples include:
1. Telecommunication switch alarm logs (events may be types of faults occurring in the switch).
2. Alarm messages on medical monitoring equipment. Patterns here may provide diagnostically significant information to clinicians.
3. Analysis of intelligence surveillance logs.
In a transactional context, examples include:
1. Telephone call data.
2. Bank account transfer data.
3. Email message data.
4. Internet traffic data.
5. Movements of goods/people between countries/towns.
The method may also be applied directly to any combination of the above. For example, fraud investigators might wish to search for repeating patterns in movements recorded in a surveillance log, telephone call data and bank transfer data. A typical pattern might then be: Mr Pink visits Mr Blue, Mr Blue telephones Mr Big and cash flows from Account 1 to Account 2.