1. Field of the Invention
The present invention relates to databases and, more particularly, to indexing weighted-sequences in large databases.
2. Description of the Related Art
Fast sequence indexing is essential to many applications, including time series analysis,multimedia database management, network intrusion detection, and the like. Recently, the field of molecular genetics has received increasing attention and is widely recognized as being one of the key technologies today.
Consider a domain of event management for complex networks where events or messages are generated when special conditions arise. Each event, as well as the environment in which it occurs, is logged into a database. Given a large data set of event sequences, a typical type of query (i.e., an event sequence match) is illustrated
EventTimestamp......CiscoDCDLinkUp19:08:01MLMSocketClose19:08:07MLMStatusUp19:08:21......MiddleLayerManagerUp19:08:37CiscoDCDLinkUp19:08:39......Among other possible attributes of the data set (e.g., Host, Severity, etc.), the attributes Event and Timestamp are shown. The event sequence match shown above can result from the following query: Find all occurrences of CiscoDCDLinkUp that are followed by MLMStatusUp that are followed, in turn, by CiscoDCDLinkUp, under the constraint that the interval between the first two events is 20±2 seconds, and the interval between the 1st and 3rd events is 40±3 seconds. Answering such queries efficiently is important to understanding temporal causal relationships among events, which often provide actionable insights for determining problems in system management.
A query can involve any number of events, and each event has an approximate weight, which, as described herein, is the elapsed time between the occurrence of the event and the occurrence of the first event (CiscoDCDLinkUp) in the query sequence. There are generally two characteristic issues in event sequences (i.e., the weighted-sequence problem): (1) In real life datasets, more often than not, certain events occur more frequently than others (this may affect query performance); and (2) It is unlikely that two causally related events are separated by a very large time gap. Currently known solutions do not address the weighted-sequence problem.
There has been much research on indexing substrings. A suffix tree, for example, is a very useful data structure that embodies a compact index to all the distinct, non-empty substrings of a given string. The suffix tree is described in greater detail in E. M. McCreight. A space-economical suffix tree construction algorithm, Journal of the ACM, 23(2):262-272, April 1976.
The suffix tree, however, is not adequate to solve the problems event sequence matching, as described above, because it only provides fast accesses for searching contiguous subsequences in a string database. More specifically, in string matching, the relative positions of two elements in a string is also used to embody the distance between them, while in the example provided above, the distance between two elements is expressed explicitly by another dimension (i.e., the weight).
Similarity based subsequence matching has been a research focus for applications such as time series databases. Similarity based subsequence matching is described in greater detail in C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases, In SIGMOD, pages 419-429, 1994. The basic idea is to map each data sequence into a small set of multidimensional rectangles in the feature space. Traditional spatial access methods (e.g., R-tree) are then used to index and retrieve these rectangles. Here, retrieval is based on similarity of the time-series within a continuous time interval. The method cannot be applied to solve the weighted-sequence problem because the pattern to retrieve is usually a non-contiguous subsequence in the original sequence.
Recently, the problem of exact matching for multidimensional strings has been addressed in H. V. Jagadish, N. Koudas, and D. Srivastava, On effective multi-dimensional indexing for strings, In SIGMOD, pages 403-414, 2000. Strings are mapped to real numbers based on their lexical order. Then these multidimensional points are indexed using R-trees. This technique works efficiently for queries such as “find a person whose name begins with Sri and telephone number begins with 973”. However, this technique does not address how to find objects that match a given pattern instead of exact values.
There has been little research in fast retrieval of numerical patterns in relational tables. The techniques described above cannot be applied directly to solve the weighted-sequence problem, largely because they only handle one-dimensional series. On the other hand, much research has been devoted to finding frequent patterns in large database (e.g., Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized association rules. In VLDB, Zurich, Switzerland, September 1995.). These methods typically scan a data set multiple times in order to find patterns whose occurrence level is beyond a threshold. That is, finding frequent patterns is a clustering problem, which typically involves scanning the data set again and again to find patterns that occur frequently. Therefore, the complexity of these algorithms is at least O(N). Some are even of exponential complexity.
Accordingly, there exists a need for an efficient solution for searching large databases to find objects that exhibit a given pattern or sequence of events.