Temporal or spatial-temporal data constitutes a large portion of the data stored in computers. A need exists in many emerging applications for similarity matches as opposed to exact matches on the data. For example, in various commercial applications, such as:
identifying companies with similar growth patterns; PA0 determining products with similar selling patterns; PA0 identifying stocks having similar long or short term price trends; PA0 identifying specific weather patterns; PA0 identifying specific geological features; PA0 identifying specific environmental pollution; and PA0 identifying specific astrophysics patterns.
and in various scientific applications, such as:
A similarity search against a database consisting of a collection of objects usually involves the specification of a target. The objects within a user-defined distance from the target will then be retrieved. Similarity searches usually incorporate a similarity measure or a distance metric. Two patterns are considered to be "similar" if the distance metric is less than a predefined threshold.
One example of a prior art search technique is described by R. Agrawal, C. Faloutsos, and A. Swami, in an article entitled "Efficient Similarity Search in Sequence Database," Fourth International Conference on Foundations of Data Organization and Algorithms, Chicago, October 1993, similarity matches are based on the computation of the mean-square-error of the first few Fourier coefficients of two sequences. However, this method does not address the issue related to scaling and possible phase differences between two sequences. Moreover, the target sequence and the sequences in the database must have the same length. This problem is addressed in C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, "Fast Subsequence Matching in Time-series database," Proc. SIGMOD'94, pp. 419-429, 1994, in which a similarity search is performed on all possible subsequences by generating the first few Fourier coefficients of all possible subsequences of a given length for each sequence. The two-Fourier-coefficient representation of each subsequences can be viewed as a point in two-dimensional feature space. The locations of several points in the Fourier domain, each of which corresponds to a subsequence, can be combined and approximately represented by a rectangle, thus reducing overall storage requirements. This method, nevertheless, does not solve the scaling problem. Another problem is that insufficient information may be retained in the feature space, which can significantly increase the number of false hits.
The aforementioned co-pending U.S. patent application by V. Castelli et al., describes a new method for constructing a database that allows similarity matches which are insensitive to possible scale and phase differences between the sequences stored in the database and the target sequence. Furthermore, many more features from the original temporal and/or spatia-temporal sequences are retained, thus reducing the possibility of false hits. In this method, each sequence to be stored in the database is segmented into non-overlapping or minimally overlapping subsequences of equal length. Each subsequence is then normalized (such as with respect to the energy or maximum amplitude of each sequence) and transformed into a series of coefficients in the feature space. A search is performed based on a hierarchical correlation in the feature space between the target sequence and the subsequences. The target sequence and the stored sequences are correlated first at the lowest level in the hierarchy. At any given level, a match is declared when the correlated result is larger than a predetermined threshold. Sequences that fail to satisfy the matching criterion are discarded. The process is continued at the next level until the highest level is reached. Because of the hierarchical search, a linear scan of the entire sequence can be avoided. Although this approach is phase and scale insensitive, it does not allow similarity searches to be performed at a semantic level.
Thus, a need exists for a method and system for performing similarity searches which is phase and scale insensitive and which allows similarity searches to be performed at a semantic level. The present invention addresses such a need.