1. Field of the Invention
The present invention relates generally to an apparatus and method for similarity searches using a hyper-rectangle based multidimensional data segmentation, and more particularly to an apparatus and method which can efficiently perform the segmentation with respect to data sets representable by multidimensional data sequences (MDS""s), such as video streams, and can search for similarity using the segmentation.
2. Description of the Prior Art
For the past several years, time-series data have been thoroughly studied in database applications such as data mining and data warehousing. Time-series data are a series of real numbers, which represent values at time points. For example, the time-series data can be a sequence of real numbers such as the prices of stocks or commercial goods, weather patterns, sales indicators, biomedical measurements, and etc.
The examples of the time-series data are disclosed in detail in a thesis entitled xe2x80x9cSimilarity-Based queries for time series dataxe2x80x9d (May, 1997) by xe2x80x9cD. Rafei, A. Medelzonxe2x80x9d and published in xe2x80x9cProceedings of ACM SIGMOD Int""l Conference on Management of Dataxe2x80x9d. However, because such example values are basically one-dimensional data, most research still concentrates on indexes and searches for one-dimensional data sequences.
As the use of multimedia data has spread to many application domains, the efficient retrieval of multidimensional, voluminous and complex information, which are the intrinsic characteristics of multimedia data, is becoming increasingly important. The present invention, as described later, belongs to retrieval technology areas for data represented by sequences, such as time-series data and multimedia data, in accordance with this retrieval requirement.
In the prior art, various similarity search methods for time-series data have been proposed.
First, there is a whole sequence matching method. This method is described in detail in a thesis entitled xe2x80x9cEfficient Similarity Search in Sequence Databasesxe2x80x9d by xe2x80x9cR. Agrawal, C. Faloutsos, A. Swamixe2x80x9d and published in xe2x80x9cProceedings of Foundations of Data Organizations and algorithms (FODO)xe2x80x9d. The method is problematic in that two sequences to be compared must be of equal length. That is, the method maps the time sequences into the frequency domain, and uses the Discrete Fourier Transform (DFT) to solve the dimensionality curse problem. In this case, each sequence whose dimensionality is reduced by using the DFT is mapped into a lower-dimensional point in the frequency domain, and is indexed and stored using R*-Tree. However, this method is limited in that a database sequence and a query sequence must be of equal length, as described above.
Second, there is a fast subsequence matching method. This method is disclosed in detail in a thesis entitled xe2x80x9cFast Subsequence Matching in Time-Series Databasesxe2x80x9d by xe2x80x9cC. Faloutsos, M. Ranganathan, Y. Manolopoulosxe2x80x9d and published in xe2x80x9cProceedings of ACM SIGMOD Int""l Conference on Management of Data (May, 1994.)xe2x80x9d. The basic idea of this method is that, using a sliding window with a size of w with respect to a data sequence, it represents w one-dimensional values included in each window by a single w-dimensional point, and transforms a one-dimensional data sequence into a lower-dimensional sequence using the DFT. The lower-dimensional data sequence is partitioned into subsequences. In this case, each subsequence is represented by a Minimum Bounding Rectangle (MBR) and is indexed and stored using xe2x80x9cST-indexxe2x80x9d. On the other hand, a query sequence is divided into one or more subsequences each with a size of w, each of which is represented by a w-dimensional point. The query processing is based on the MBRs of a data sequence stored in a database and each query point.
However, a point in the multidimensional sequence such as video sequences is semantically different from that of one-dimensional time-series data. In the multidimensional sequence, a point itself is a vector in the multidimensional space which has various feature values.
A query in a query process of the multidimensional sequence is given as a multidimensional sequence, and the query sequence is also divided into multiple subsequences. In one-dimensional sequence, each query subsequence is represented by a single point. However, in the multidimensional sequence, each subsequence cannot be represented by a single point, (because each point contained in each subsequence is multidimensional), such that this method cannot be used in the similarity search of the multidimensional sequence.
Further, this method performs clustering (or segmentation) based on a Marginal COST (MCOST) defined as the average number of disk accesses (DA) divided by the number of points in the MBR. That is, if a point is to be included in the cluster or MBR during the segmentation process, this algorithm considers the volume increment of the cluster due to the point included in the cluster as an important clustering factor in determining the MCOST. However, because the algorithm only considers the volume factor, it is insufficient to cover all of possible cases.
Third, there is a method using a set of safe linear transformations of a given sequence. This method is disclosed in detail in a thesis entitled xe2x80x9cSimilarityxe2x80x94Based queries for time series data (May, 1997)xe2x80x9d by xe2x80x9cD. Rafei, A. Mendelzonxe2x80x9d and published in xe2x80x9cProceedings of ACM SIGMOD Int""l Conference on Management of Dataxe2x80x9d.
The set of safe linear transformations of the given sequence can be used as the basis of similarity queries for time-series data. Elements of this set formulate functions such as moving average, reversing and time warping. At this time, such transformation functions are extended to multiple transformations, where an index is searched for only once and a collection of transformations are simultaneously applied to the index, instead of searching for the index multiple times and each time applying a single transformation.
However, all of the above proposed methods handle the similarity search for one-dimensional time-series data, such that the methods cannot be applied to the multidimensional data sequence. Further, these methods are problematic in that they only focus on the problem of searching a database for candidate sequences whose similarities to a query sequence do not exceed a given threshold.
Meanwhile, a similarity search method for multidimensional data sequence, as proposed later in the present invention, uses a hyper-rectangle based segmentation, and technical fields related to the hyper-rectangle based segmentation are described as follows.
A clustering problem has been considerably studied in many database applications such as customer segmentation, sales analysis, pattern recognition and similarity search. The task of clustering data points is defined as follows: xe2x80x9cGiven a set of points in a multidimensional space, partition the points into clusters such that points within each cluster have similar characteristics while points in different clusters are dissimilar. At this time, a point that is considerably dissimilar to or inconsistent with the remainder of the data is referred to as an outlier.xe2x80x9d
Conventional methods for clustering data points in a multidimensional space can include the following methods.
First, there is a method named xe2x80x9cCLARANSxe2x80x9d proposed in a thesis entitled xe2x80x9cEfficient and effective clustering methods for spatial data miningxe2x80x9d by xe2x80x9cR. T. Ng and J. Hanxe2x80x9d and published in xe2x80x9cProceedings of Int""l Conference on Very Large Data Basesxe2x80x9d. The CLARANS method is based on a randomized search method and achieves its efficiency by reducing the search space using two user-supplied input parameters.
Second, there is a method named xe2x80x9cBIRCHxe2x80x9d proposed in a thesis entitled xe2x80x9cBIRCH: An efficient data clustering method for very large databasesxe2x80x9d by xe2x80x9cT. Zhang, R. Ramakrishnan, and M. Livnyxe2x80x9d and published in xe2x80x9cProceedings of ACM SIGMOD Int""l Conference on Management of Dataxe2x80x9d. The xe2x80x9cBIRCHxe2x80x9d method is a multiphase clustering method for constructing a hierarchical data structure called CF (clustering feature)-tree by scanning a database. Further, the BIRCH uses an arbitrary clustering algorithm so as to cluster leaf nodes of the CF-tree. Such a method is the first approach that effectively handles outliers in the database area.
Third, there is a method named xe2x80x9cDBSCANxe2x80x9d proposed in a thesis entitled xe2x80x9cA density-based algorithm for discovering clusters in large spatial databases with noisexe2x80x9d by xe2x80x9cM. Ester, H. P. Kriegel, J. Sander, and X. Xuxe2x80x9d and published in xe2x80x9cInt""l Conference on Knowledge Discovery in Databases and Data Miningxe2x80x9d. The xe2x80x9cDBSCANxe2x80x9d method tries to minimize requirements of domain knowledge to determine input parameters and provides arbitrary shapes of clusters based on the distribution of data points. The basic idea of the method is that for each point in a cluster, the neighborhood of the point within a given radius should contain at least a given number of points. Therefore, the method requires only two input parameters (i.e., radius and the minimum number of points).
Fourth, there is a method named xe2x80x9cCLIQUExe2x80x9d proposed in a thesis entitled xe2x80x9cAutomatic subspace clustering of high dimensional data for data mining applicationsxe2x80x9d by xe2x80x9cR. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavanxe2x80x9d and published in xe2x80x9cProceedings of ACM SIGMOD Int""l Conference on Management of Dataxe2x80x9d. xe2x80x9cCLIQUExe2x80x9d is a method for automatically identifying dense clusters in subspaces of a given high-dimensional data space. That is, the method is suitable where even though a cluster is not detected in a given space, the cluster can exist in the subspaces. Further, the method needs the size of the grid for partitioning the space and the global density threshold for clusters as the input parameters.
Fifth, there is a method named xe2x80x9cCURExe2x80x9d proposed in a thesis entitled xe2x80x9cCURE: An efficient clustering algorithm for large databasesxe2x80x9d by xe2x80x9cS. Guha, R. Rastogi, and Shimxe2x80x9d and published in xe2x80x9cProceedings of ACM SIGMOD Int""l Conference on Management of Dataxe2x80x9d. The xe2x80x9cCURExe2x80x9d as a recent approach identifies clusters having non-spherical shapes and wide variances in size. In such a method, each cluster is represented with multiple well-scattered points. The shape of a non-spherical cluster is better represented when more than one point are used. Such a clustering algorithm finishes the clustering process when the number of generated clusters reaches a given value as an input parameter.
However, the conventional clustering methods require multiple input parameters, and do not consider temporal and semantic relationship between data points. Consequently, the conventional clustering methods are problematic in that they cannot be applied to the clustering methods of data sequences such as video clips, in which temporal and semantic relationship between frames are regarded as important.
Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior art, and an object of the present invention is to provide an apparatus and method, which partitions a multidimensional data sequence, such as a video stream, into segments in consideration of the temporal relationship between points, and efficiently searches a database for a multidimensional data sequence similar to a given query sequence.
In accordance with one aspect of the present invention, the above object can be accomplished by the provision of an apparatus for hyper-rectangle based multidimensional data similarity searches, the multidimensional data being representable by a multidimensional data sequence, comprising MBR generation means for segmenting a multidimensional data sequence to be partitioned into subsequences, and representing each subsequence by each Minimum Bounding Rectangle (MBR), such that sets of MBRs are generated from the multidimensional data sequence, and the MBR sets are stored in a database; first sequence pruning means for pruning irrelevant data sequences using a distance Dmbr between MBRs extracted from an inputted query sequence and the MBR sets stored in the database in a multidimensional Euclidean space; second sequence pruning means for pruning irrelevant data sequences using a normalized distance Dnorm between MBRs extracted from the query sequence and the MBR sets of data sequences remaining after the data sequences are pruned by the first sequence pruning means in a multidimensional Euclidean space; and subsequence finding means for finding subsequences similar to the given query sequence by obtaining sets of points contained in MBRs involved in a calculation of the distance Dnorm from each sequence obtained using the distance Dnorm.
In accordance with another aspect of the present invention, there is provided an apparatus for hyper-rectangle based multidimensional data similarity searches, comprising MBR generation means for segmenting a multidimensional data sequence to be partitioned into subsequences, and representing each subsequence by each Minimum Bounding Rectangle (MBR), such that sets of MBRs are generated from the multidimensional data sequence, and the MBR sets are stored in a database; first sequence pruning means for pruning irrelevant data sequences using a distance Door between MBRs extracted from an inputted query sequence and the MBR sets stored in the database in a multidimensional Euclidean space; second sequence pruning means for pruning irrelevant data sequences using a normalized distance Dnorm between MBRs extracted from the query sequence and the MBR sets of data sequences remaining after the data sequences are pruned by the first sequence pruning means in a multidimensional Euclidean space; and subsequence finding means for finding subsequences similar to the given query sequence by obtaining sets of points contained in MBRs involved in a calculation of the distance Dnorm from each sequence obtained using the distance Dnorm; wherein the MBR generation means includes threshold calculation means for inputting a multidimensional sequence Si and the minimum number of points per segment minPts, and calculating bounding threshold values for a volume and an edge using a unit hyper-cube occupied by a single point in n-dimensional unit space, if points are uniformly distributed in a hyper-rectangle which is a minimum bounding rectangle containing all points in the sequence Si, segment generation means for initializing a segment set and an outlier set to empty sets and generating a current segment using a first point of the sequence Si, geometric condition determination means for determining whether a next point of the sequence Si satisfies a geometric condition using the bounding threshold values for the volume and the edge, segment merging means for merging the next point of the sequence Si into the current segment if geometric condition is satisfied, and segment updating means for including the current segment in the segment set and re-generating a new current segment using the next point of the sequence Si, if the geometric condition is not satisfied and the number of points contained in the current segment exceeds the minimum number of points per segment minPts.
In accordance with still another aspect of the present invention, there is provided an apparatus for hyper-rectangle based multidimensional data similarity searches, comprising MBR generation means for segmenting a multidimensional data sequence to be partitioned into subsequences, and representing each subsequence by each Minimum Bounding Rectangle (MBR), such that sets of MBRs are generated from the multidimensional data sequence, and the MBR sets are stored in a database; first sequence pruning means for pruning irrelevant data sequences using a distance Dmbr between MBRs extracted from an inputted query sequence and the MBR sets stored in the database in a multidimensional Euclidean space; second sequence pruning means for pruning irrelevant data sequences using a normalized distance Dnorm between MBRs extracted from the query sequence and the MBR sets of data sequences remaining after the data sequences are pruned by the first sequence pruning means in a multidimensional Euclidean space; and subsequence finding means for finding subsequences similar to the given query sequence by obtaining sets of points contained in MBRs involved in a calculation of the distance Dnorm from each sequence obtained using the distance Dnorm; wherein the MBR generation means includes threshold calculation means for inputting a multidimensional sequence Si and the minimum number of points per segment minPts, and calculating bounding threshold values for a volume and a semantic factor using a unit hyper-cube occupied by a single point in n-dimensional unit space, if points are uniformly distributed in a hyper-rectangle which is a minimum bounding rectangle containing all points in the sequence Si, segment generation means for initializing a segment set and an outlier set to empty sets and generating a current segment using a first point of the sequence Si, geometric and semantic condition determination means for determining whether a next point of the sequence Si satisfies geometric and semantic conditions using the bounding threshold values for the volume and the semantic factor, segment merging means for merging the next point of the sequence Si into the current segment if the geometric and semantic conditions are satisfied, and segment updating means for including the current segment in the segment set and re-generating a new current segment using the next point of the sequence Si, if the geometric and semantic conditions are not satisfied and the number of points contained in the current segment exceeds the minimum number of points per segment minPts.
In accordance with still another aspect of the present invention, there is provided an apparatus for hyper-rectangle based multidimensional data similarity searches, comprising MBR generation means for segmenting a multidimensional data sequence to be partitioned into subsequences, and representing each subsequence by each Minimum Bounding Rectangle (MBR), such that sets of MBRs are generated from the multidimensional data sequence, and the MBR sets are stored in a database; first sequence pruning means for pruning irrelevant data sequences using a distance Dmbr between MBRs extracted from an inputted query sequence and the MBR sets stored in the database in a multidimensional Euclidean space; second sequence pruning means for pruning irrelevant data sequences using a normalized distance Dnorm between MBRs extracted from the query sequence and the MBR sets of data sequences remaining after the data sequences are pruned by the first sequence pruning means in a multidimensional Euclidean space; and subsequence finding means for finding subsequences similar to the given query sequence by obtaining sets of points contained in MBRs involved in a calculation of the distance Dnorm from each sequence obtained using the distance Dnorm; wherein the MBR generation means includes threshold calculation means for inputting a multidimensional sequence Si and the minimum number of points per segment minPts, and calculating bounding threshold values for a volume, an edge and a semantic factor using a unit hyper-cube occupied by a single point in n-dimensional unit space, if points are uniformly distributed in a hyper-rectangle which is a minimum bounding rectangle containing all points in the sequence Si, segment generation means for initializing a segment set and an outlier set to empty sets and generating a current segment using a first point of the sequence Si, geometric and semantic condition determination means for determining whether a next point of the sequence Si satisfies geometric and semantic conditions using the bounding threshold values for the volume, the edge and the semantic factor, segment merging means for merging the next point of the sequence Si into the current segment if the geometric and semantic conditions are satisfied, and segment updating means for including the current segment in the segment set and re-generating a new current segment using the next point of the sequence Si, if the geometric and semantic conditions are not satisfied and the number of points contained in the current segment exceeds the minimum number of points per segment minPts.
In accordance with still another aspect of the present invention, there is provided a method for a hyper-rectangle based multidimensional data similarity searches, the multidimensional data being representable by a multidimensional data sequence, comprising the steps of segmenting a multidimensional data sequence to be partitioned into subsequences, and representing each subsequence by each Minimum Bounding Rectangle (MBR), such that sets of MBRs are generated from the multidimensional data sequence, and the MBR sets are stored in a database; pruning irrelevant data sequences using a distance Dmbr between MBRs extracted from an inputted query sequence and the MBR sets stored in the database in a multidimensional Euclidean space; pruning irrelevant data sequences using a normalized distance Dnorm between MBRs extracted from the query sequence and the MBR sets of data sequences remaining after the data sequences are pruned in a multidimensional Euclidean space; and finding subsequences similar to the given query sequence by obtaining sets of points contained in MBRs involved in a calculation of the distance Dnorm from each sequence obtained using the distance Dnorm.
In accordance with still another aspect of the present invention, there is provided a method for a hyper-rectangle based multidimensional data similarity searches, comprising the steps of segmenting a multidimensional data sequence to be partitioned into subsequences, and representing each subsequence by each Minimum Bounding Rectangle (MBR), such that sets of MBRs are generated from the multidimensional data sequence, and the MBR sets are stored in a database; pruning irrelevant data sequences using a distance Dmbr between MBRs extracted from an inputted query sequence and the MBR sets stored in the database in a multidimensional Euclidean space; pruning irrelevant data sequences using a normalized distance Dnorm between MBRs extracted from the query sequence and the MBR sets of data sequences remaining after the data sequences are pruned in a multidimensional Euclidean space; and finding subsequences similar to the given query sequence by obtaining sets of points contained in MBRs involved in a calculation of the distance Dnorm from each sequence obtained using the distance Dnorm; wherein the MBR generation step includes the steps of inputting a multidimensional sequence Si and the minimum number of points per segment minPts, and calculating bounding threshold values for a volume and an edge using a unit hyper-cube occupied by a single point in n-dimensional unit space, if points are uniformly distributed in a hyper-rectangle which is a minimum bounding rectangle containing all points in the sequence Si, initializing a segment set and an outlier set to empty sets and generating a current segment using a first point of the sequence Si, determining whether a next point of the sequence Si satisfies a geometric condition using the bounding threshold values for the volume and the edge, merging the next point of the sequence Si into the current segment if the geometric condition is satisfied, and including the current segment in the segment set and updating the segment set by re-generating a new current segment using the next point of the sequence Si, if the geometric condition is not satisfied and the number of points contained in the current segment exceeds the minimum number of points per segment minPts.
In accordance with still another aspect of the present invention, there is provided a method for a hyper-rectangle based multidimensional data similarity searches, comprising the steps of segmenting a multidimensional data sequence to be partitioned into subsequences, and representing each subsequence by each Minimum Bounding Rectangle (MBR), such that sets of MBRs are generated from the multidimensional data sequence, and the MBR sets are stored in a database; pruning irrelevant data sequences using a distance Dmbr between MBRs extracted from an inputted query sequence and the MBR sets stored in the database in a multidimensional Euclidean space; pruning irrelevant data sequences using a normalized distance Dnorm between MBRs extracted from the query sequence and the MBR sets of data sequences remaining after the data sequences are pruned in a multidimensional Euclidean space; and finding subsequences similar to the given query sequence by obtaining sets of points contained in MBRs involved in a calculation of the distance Dnorm from each sequence obtained using the distance Dnorm; wherein the MBR generation step includes the steps of inputting a multidimensional sequence Si and the minimum number of points per segment minPts and calculating bounding threshold values for a volume and a semantic factor using a unit hyper-cube occupied by a single point in n-dimensional unit space, if points are uniformly distributed in a hyper-rectangle which is a minimum bounding rectangle containing all points in the sequence Si, initializing a segment set and an outlier set to empty sets and generating a current segment using a first point of the sequence Si, determining whether a next point of the sequence Si satisfies geometric and semantic conditions using the bounding threshold values for the volume and the semantic factor, merging the next point of the sequence Si into the current segment if the geometric and semantic conditions are satisfied, and including the current segment in the segment set and updating the segment set by re-generating a new current segment using the next point of the sequence Si, if the geometric and semantic conditions are not satisfied and the number of points contained in the current segment exceeds the minimum number of points per segment minPts.
In accordance with still another aspect of the present invention, there is provided a method for a hyper-rectangle based multidimensional data similarity searches, comprising the steps of segmenting a multidimensional data sequence to be partitioned into subsequences, and representing each subsequence by each Minimum Bounding Rectangle (MBR), such that sets of MBRs are generated from the multidimensional data sequence, and the MBR sets are stored in a database; pruning irrelevant data sequences using a distance Dmbr between MBRs extracted from an inputted query sequence and the MBR sets stored in the database in a multidimensional Euclidean space; pruning irrelevant data sequences using a normalized distance Dnorm between MBRs extracted from the query sequence and the MBR sets of data sequences remaining after the data sequences are pruned in a multidimensional Euclidean space; and finding subsequences similar to the given query sequence by obtaining sets of points contained in MBRs involved in a calculation of the distance Dnorm from each sequence obtained using the distance Dnorm; wherein the MBR generation step includes the steps of inputting a multidimensional sequence Si and the minimum number of points per segment minPts, and calculating bounding threshold values for a volume, an edge and a semantic factor using a unit hyper-cube occupied by a single point in n-dimensional unit space, if points are uniformly distributed in a hyper-rectangle which is a minimum bounding rectangle containing all points in the sequence Si, initializing a segment set and an outlier set to empty sets and generating a current segment using a first point of the sequence Si, determining whether a next point of the sequence Si satisfies geometric and semantic conditions using the bounding threshold values for the volume, the edge and the semantic factor, merging the next point of the sequence Si into the current segment if the geometric and semantic conditions are satisfied, and including the current segment in the segment set and updating the segment set by re-generating a new current segment using the next point of the sequence Si, if the geometric and semantic conditions are not satisfied and the number of points contained in the current segment exceeds the minimum number of points per segment minPts.