1. Field of the Invention
The present invention relates to generally to data matching and, more particularly, to event sequence matching.
2. Prior Art
Monitoring a large telecommunication network can result in an extensive log of alarms or other events of different types that occurred in the system. Similar log files may also be produced in mobile commerce systems, in web applications, and in mobile services. Such logs, or event sequences, generally consist of pairs (e,t), where e is an event type and t is the occurrence time of the event.
The sequences of events in a data flow can be, for example, sequences of events (alarms) with their corresponding occurrence times in a telecommunications network. The purpose of finding similar situations in these sequences of events, as in many other data analysis applications, is to predict events and to understand the dynamics of the process producing the sequence. In these applications similarity finding can help to customise individual services or interfaces by prediction and regularities based on previous behaviour.
The problem of finding similar situations can be described as follows. With a sequence of events S=(<e1,t1), . . . ,(en,tn>) and a time t and a window width w, find another time s, such that the subsequences of S, defined as the subsequences of events of S occurring in the half-open intervals (t−w,t) and (s−w,s] respectively, from here on called slices S(t,w) and S(s,w) of S, are similar. The slices are sequences of events. The similarity between two slices can be defined using an edit distance notion, i.e. the distance is defined as the cost of the cheapest possible sequence of operations that transforms one slice to another. The operations are insertion and deletion of an event and moving an event in time and each operation has an associated cost. The edit distance can be computed using known dynamic programming algorithm.
Prior art solutions for finding similar situations using known dynamic programming algorithms are computationally slow in time because of the high computational complexity of these algorithms. Furthermore, assigning costs to the edit operations is quite problematic as disclosed in “Pirjo Moen. Attribute, Event Sequence and Event Type Similarity Notions for Data Mining. PhD thesis, University of Helsinki, Department of Computer Science, Finland, February 2000”. In prior art practices, there has also been considerable interest in defining intuitive and easily computable measures of similarity between complex objects and in using abstract similarity notions in querying databases as disclosed in: [1] Gautam Das, Heikki Mannila and Pirjo Ronkainen, “Similarity of attributes by external probes”, in Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), pages 23-29, 1998; [2] E. -H. Han, G. Karypis, V. Kumar and B. Mobasher, “Clustering based on association rule hypergraphs”, in Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997; [3] H. V. Jagadish, A. O. Mendelzon and T. Milo, “Similarity-based queries”, in Proceedings of the 14th Symposium on Principles of Database Systems (PODS), pages 36-45, 1995; [4] A. J. Knobbe and P. W. Adriaans, “Analyzing binary associations”, in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD), pages 311-314, 1996; [5] Y. Karov and S. Edelman, “Similarity-based word sense disambiguation”, in Computational Linguistics, 24(1):41-59, 1998; and [7] D. A. White and R. Jain, “Algorithms and strategies for similarity retrieval”, in Technical Report VCL-96-101, Visual Computing Laboratory, UC Davis, 1996.
With ever increasing amounts of information surrounding us in our every day life and the numerous applications, services, etc., of which the quality relies on data processing, faster and more reliable methods for information retrieval, and for yielding added value from data, are needed to make better or even new applications, services, etc. possible. In many fields of applications, time series or ordered sets of data are an advantageous way of modelling data for many valuable end results.
Therefore, it is desirable to provide a method and system to efficiently analyze large amounts of data.