The present invention relates to subsequence matching method in time-series databases, and particularly to such a method which improves performance by using duality in constructing windows, in time-series databases.
First, we define some terminology needed in further description of the present invention.
A xe2x80x9csequencexe2x80x9d of length n is an array of n entries. xe2x80x9cTime-series dataxe2x80x9d are sequences of real numbers, representing values at specific time points. A xe2x80x9ctime-series databasexe2x80x9d is the database that stores time-series data.
The time-series data stored in a time-series database are called xe2x80x9cdata sequences.xe2x80x9d The sequences given by a user are called xe2x80x9cquery sequences.xe2x80x9d Finding data sequences similar to the query sequence from the database is called xe2x80x9csimilar sequence matching.xe2x80x9d
In the above definition, two sequences are said to be xe2x80x9csimilarxe2x80x9d if the distance between them is less than or equal to the user specified xe2x80x9ctolerancexe2x80x9d xcex5. We define that two sequences X and Y are in xe2x80x9cxcex5-matchxe2x80x9d if the distance between X and Y is less than or equal to xcex5. We define xe2x80x9cn-dimensional distance computationxe2x80x9d as the operation that computes the distance between two sequences of length n.
In the above distance computation, the present invention is independent of the specific distance computation method. For easy understanding the present invention, however, we describe it based on the Euclidean distance computation method. Given two sequences X={x0, x1, . . . , Xnxe2x88x921} and Y={y0, Y1, . . . , ynxe2x88x921} of the same length n, the Euclidean distance between X and Y is defined as                     ∑                  i          =          0                          n          -          1                    ⁢              xe2x80x83            ⁢                        (                                    x              i                        -                          y              i                                )                2              .
If a sequence S includes a sequence A(i.e., A is a part of S), A is called a xe2x80x9csubsequencexe2x80x9d of S. Similar sequence matching can be classified into the following two categories:
Whole matching: Given N data sequences S1, S2, . . . SN, a query sequence Q, and the tolerance xcex5, we find those data sequences that are in xcex5-match with Q. Here, the data and query sequences must have the same length.
Subsequence matching: Given N data sequences S1, S2, . . . , SN of varying lengths, a query sequence Q, and the tolerance xcex5, we find all the sequences Si, one or more subsequences of which are in xcex5-match with Q, and the offsets in Si of those subsequences.
A xe2x80x9cWindowxe2x80x9d is a unit of dividing sequences. According to the dividing method, windows are classified into a sliding window and a disjoint window. The windows starting from every possible offset in a sequence are called xe2x80x9csliding windows.xe2x80x9d FIG. 1a is an example drawing of a method that divides a sequence into sliding windows of size 4. In FIG. 1a, reference no. 201 is a sequence, and reference no. 202 are sliding windows of size 4. The windows starting from multiple offsets of window size are called xe2x80x9cdisjoint windows.xe2x80x9d FIG. 1b is an example drawing of a method that divides a sequence into disjoint windows of size 4. In FIG. 1b, reference no. 203 is a sequence, and reference no. 204 are disjoint windows of size 4.
In subsequence matching, xe2x80x9cfalse dismissalsxe2x80x9d are the subsequences that are in xcex5-match with the given query sequence but missed by errors, and xe2x80x9cfalse alarmsxe2x80x9d are the subsequences that are not in xcex5-match with the query sequence but selected as similar subsequences. False dismissals and false alarms should not occur in the above similar sequence matching.
The function used to extract f, which is less than n, features from a sequence of length n is called the xe2x80x9cfeature extraction function.xe2x80x9d To use a feature extraction function in similar sequence matching, the function should guarantee no false dismissals. To guarantee no false dismissals, the feature extraction function is satisfied some conditions that are presented in Agrawal, R., Faloutsos, C., and Swami, A., xe2x80x9cEfficient Similarity Search in Sequence Databases,xe2x80x9d In Proc. the 4th Int""l Conf. on Foundations of Data Organization and Algorithms, Chicago, Ill., pp. 69-84, October 1993.[Reference 1] and Faloutsos, C., Ranganathan, M., and Manolopoulos, Y., xe2x80x9cFast Subseqeunce Matching in Time-Series Databases,xe2x80x9d In Proc. Int""l Conf. on Management of Data, ACM SIGMOD, Minneapolis, Minn., pp. 419-429, May 1994.[Reference 2]
We also define some notation to be needed in further description of the present invention.
Len(S) is the length of sequence S. S[k] is the k-th entry of the sequence S, S[i:j] is the subsequence that is including entries from the i-th one to j-th, and S[i:j] can be represented as S[i:k]S[k+1:j]. Next, when S is divided into disjoint windows, si represents the i-th disjoint window of sequence S. Lastly, xcfx89 is the length of the sliding or disjoint window.
Recently, the large amount of time-series data are occurred in various areas such as stock prices, growth rates of companies, exchange rates, biomedical measurements, and weather data. And, owing to faster computing speed and larger storage devices, there have been a number of efforts to utilize the large amount of time-series data. Especially, similar sequence matching in time-series data has become an importance research topic in data mining that is one of new database applications.
In the below description, we explain the previous similar sequence matching methods in time-series databases.
In the previous method of [Reference 1], authors have introduced a solution for the whole matching problem. The outline of the solution is as follows.
First, each data sequence of length n is transformed into an f-dimensional point by using the feature extraction function, and this point is indexed using the f-dimensional index. Only a small number of features are extracted because of the difficulty in storing high-dimensional sequences in the multidimensional index due to dimensionality problem in multidimensional indexes (called xe2x80x9cdimensionality cursexe2x80x9d). Next, a query sequence is similarly transformed to an f-dimensional point, and a range query constructed using the point and the given tolerance xcex5. Then, the multidimensional index is searched to evaluate the query, a candidate set constructed consisting of the feature points that are in xcex5-match with the query sequence. This method guarantees no false dismissal, but may cause false alarms because it uses only f features instead of n.
Thus, for each candidate sequence, the actual data sequence is accessed from the disk; the distance from the query sequence computed; and the candidate is discarded if it is a false alarm. This last step, which eliminates false alarms, is called the xe2x80x9cpost-processing step.xe2x80x9d
And, in the previous method of [Reference 2], authors have proposed the subsequence matching method as a generalization of the whole matching method of [Reference 1]. In the present invention, we simply call this method xe2x80x9cFRMxe2x80x9d by taking authors"" initials. The outline of the method is as follows.
In subsequence matching, subsequences similar to the query sequence can be found anywhere in a data sequence. In FRM, to find all possible subsequences, they use a sliding window of size xcfx89 starting from every possible offset in the data sequence. Then, they divide a query sequence into disjoint windows of size xcfx89 and retrieve similar subsequences by using those disjoint windows. They transform each sliding window to a point in a lower dimensional space. Since too many points are generated to be stored individually in an index, they construct minimum bounding rectangles(MBRs) that contain hundreds or thousands of points, using a heuristic method, and then, store those MBRs into a multidimensional index. Lastly, they try to do the subsequence matching on query sequences of various lengths.
For subsequence matching on query sequences of various lengths, FRM presents and uses the following two theorems.
Theorem 1
When two sequences S and Q of the same length are divided into p disjoint windows si and qi (1xe2x89xa6ixe2x89xa6p) respectively, if S and Q are in xcex5-match, then at least one of the pairs (si, qi) are in xcex5/{square root over (p)}-match.
Theorem 2
If two sequences S and Q of the same length are in xcex5-match, then any pair of subsequences (S[i:j],Q[i:j]) are also in xcex5-match.
By using the above Theorem 1 and 2, FRM divides the query sequence into p disjoint windows, transforms each window to an f-dimensional point, makes a range query using the point and the tolerance xcex5/{square root over (p)}, and constructs a candidate set by searching the multidimensional index. Lastly, it performs the post-processing step to eliminate false alarms by accessing the data sequence and executing Len(Q)-dimensional distance computation for each candidate.
In the subsequence matching, the more false alarms are included in the candidate set constructed by searching the index, the more disk accesses and CPU operations for Len(Q)-dimensional distance computations are incurred in the post-processing step. Thus, false alarms are the main cause of performance degradation.
In the FRM, the main reason why false alarms occur is that it does not store individual points directly in the multidimensional index, but store only MBRs that contain multiple points. That is, for the same range query, there are many subsequences that do not become candidates in case of storing individual points but become candidates in case of storing only MBRs.
In the FRM, however, if every individual point are stored in the index, it generates too many f-dimensional points(almost the sum of lengths of all data sequences). And thus, it needs f times more storage than is required by original data sequences. Moreover, the search performance may significantly degrade due to the excessive height of the multidimensional index (refer to [Reference 2]). Accordingly, FRM cannot obtain the xe2x80x9cpoint-filtering effect,xe2x80x9d which reduces false alarms by storing individual points directly in the index and by using them for the point-to-point comparison, because of storing only MBRs. Thus, it has the problem of increasing many false alarms and degrading performance significantly.
The present invention is devised to solve the problems of the previous method discussed above. A purpose of the present invention is to provide a subsequence matching method in time-series databases, called xe2x80x9cDual Matchxe2x80x9d (Duality-based subsequence Matching), which reduces false alarms drastically and improves performance significantly by using duality in constructing windows, that is, by dividing data sequences into disjoint windows and the query sequence into sliding windows.
Another purpose of the present invention is to provide a subsequence matching method in time-series databases that reduces false alarms drastically and improve performance significantly by storing individual points directly in the index, in turn, by exploiting the point-filtering effect.
Another purpose of the present invention is to provide a subsequence matching method in time-series databases that creates the index faster than the previous method by reducing the number of calls to the feature extraction function, which is a major part of CPU overhead in index creation.
As the first characteristic to accomplish the purposes, the present invention provides a subsequence matching method in time-series databases that consists of the following four steps: the first step that uses duality in constructing windows; the second step that divides data sequences into disjoint windows based on the first step; the third step that divides the query sequence into sliding windows based on the first step; and the fourth step that performs subsequence matching using the windows constructed in the second and third steps.
As the additional characteristic in the above fourth step, to exploit the point-filtering effect and reduce false alarms, the present invention includes the following two steps: storing individual points-which represent the disjoint windows of the data sequences-directly in the multidimensional index and using individual points-which represent the sliding windows of the query sequence-directly in the range queries.
At this time, to reduce the number of range queries, the present invention provides the step that uses MBRs containing multiple points rather than individual pointsxe2x80x94which represent sliding windows of the query sequencexe2x80x94for the range queries to construct a candidate set.
Moreover, as the additional characteristic in the fourth steps, the present invention includes the step that divides data sequences into disjoint windows rather than sliding windows for the fast index creation by reducing the number of calls to the feature extraction function that is needed in the index creation.
In the meanwhile, as the second characteristic to accomplish the purposes, the present invention provides a subsequence matching method in time-series databases that includes the following index building process to create a multidimensional index for subsequence matching.
The index building process consists of the following eight steps: the first step that creates and initializes an f-dimensional index; the second step that reads a data sequence from the database to the main memory; the third step that divides the data sequence, which is read in the second or eighth step, into disjoint windows; the fourth step that transforms the disjoint window to an f-dimensional point; the fifth step that constructs a record  less than the transformed point, the data sequence identifier, the start offset of the window greater than ; the sixth step that inserts the record into the f-dimensional index; the seventh step that checks whether there is any more sequence to read from the database or not, after repeating from the third step to the fifth step for all disjoint windows; and the eighth step that ends the index building process if there is no more data sequence to read, or continues the process by returning to the third step after reading a data sequence if there is a data sequence to read.
Moreover, as the third characteristic to accomplish the purposes, the present invention provides a subsequence matching method in time-series databases that includes the following subsequence matching process to find similar subsequences to the user specified query sequence by using the multidimensional index and the time-series database.
The subsequence matching process consists of the following seven steps: the first step that calculates the minimum number of disjoint windows included in a subsequence; the second step that divides a query sequence into sliding windows; the third step that transforms the sliding window to an f-dimensional point by using the feature extraction function; the fourth step that constructs a range query using the transformed point, the number of disjoint windows obtained from the first step, and the user specified tolerance; the fifth step that evaluates the range query, which is made in the fourth step, and constructs a candidate set by using the search result; the sixth step that reads a candidate subsequence from the database to the main memory after completing the construction of the candidate set by repeating from the third step to the fifth step for all sliding windows; and the seventh step that checks whether the candidate subsequences are false alarms or not by calculating the distances between them and the query sequence.
Moreover, as the fourth characteristic to accomplish the purposes, the present invention provides a subsequence matching method in time-series databases that includes the following enhanced subsequence matching process to find similar subsequences to the user specified query sequence by using the multidimensional index and the time-series database with the reduction of the number of range queries.
The enhanced subsequence matching process consists of the following seven steps: the first step that calculates the minimum number of disjoint windows included in a subsequence; the second step that divides a query sequence into sliding windows, transforms each sliding window to an f-dimensional point, and then construct MBRs contains these transformed points; the third step that constructs a range query using an MBR made in the second step, the number of the disjoint windows obtained from the first step, and the user specified tolerance; the fourth step that evaluates the range query constructed in the third step; the fifth step that finds candidate set by calculating the distance between each point contained in the MBR, which is used for constructing the range query in the third step, and each point in the search result in the fourth step; the sixth step that reads a candidate subsequence from the database to the main memory after completing the construction of the candidate set by repeating from the third step to the fifth step for all MBRs; and the seventh step that checks whether the candidate subsequences are false alarms or not by calculating the distances between them and the query sequence.
As the above description, Dual Match of the present invention divides data sequences into disjoint windows and a query sequence into sliding windows, on the other hand, FRM, the previous method, divides data sequence into sliding windows and the query sequence into disjoint windows. Thus, Dual Match can reduce many false alarms and improve performance by using this dual approach of the previous method.
The FRM causes many false alarms by storing only MBRs containing multiple points rather than individual points representing windows to save the storage space for the index. However, Dual Match of the present invention solves this problem by directly storing individual points in the index with the same storage space used in FRM.
Moreover, the present invention exploits the point-filtering effect that reduces false alarms by storing individual points in the index and using the stored points for the point-to-point comparison.