1. Field of the Invention
The present invention relates generally to an apparatus and method for hyper-rectangle based multidimensional data segmentation and clustering, and more particularly to an apparatus and method which can efficiently perform segmentation and clustering with respect to data sets representable by multidimensional data sequences (MDS's), such as video streams.
2. Description of the Prior Art
For the past several years, time-series data have been thoroughly studied in database applications such as data mining and dataware housing. Time-series data are a series of real numbers, which represent values at time points. For example, the time-series data can be a sequence of real numbers such as the prices of stocks or commercial goods, weather patterns, sales indicators, biomedical measurements, and etc.
The examples of the time-series data are disclosed in detail in a thesis entitled “Similarity-Based queries for time series data” (May, 1997) by “D. Rafei, A. Medelzon” and published in “Proceedings of ACM SIGMOD Int'l Conference on Management of Data”. However, because such example values are basically one-dimensional data, most research still concentrates on indexes and searches for one-dimensional data sequences.
As the use of multimedia data has spread to many application domains, the efficient representation of multidimensional, voluminous and complex information, which are the intrinsic characteristics of multimedia data, is becoming increasingly important. The present invention, as described later, belongs to clustering technology areas for data represented by sequences, such as time-series data and multimedia data, in accordance with this representation requirement.
Meanwhile, video information processing is one of areas difficult to handle in spite of its utility, because it requires huge amounts of storage space and processing power. In order to overcome such difficulties, it is essential that the video data should be effectively represented, stored and retrieved.
Video data sets are collections of video clips each having a running time ranging from several seconds to several minutes. Here, each video clip can be represented by a multidimensional data sequence (MDS), and each MDS is partitioned into segments in consideration of the temporal relationship between points. In this case, similar segments are grouped again into clusters in one sequence. Accordingly, each sequence is represented by a small number of clusters.
Meanwhile, a clustering problem has been considerably studied in many database applications such as customer segmentation, sales analysis, pattern recognition and similarity search. The task of clustering data points is defined as follows: “Given a set of points in a multidimensional space, partition the points into clusters such that points within each cluster have similar characteristics while points in different clusters are dissimilar. At this time, a point that is considerably dissimilar to or inconsistent with the remainder of the data is referred to as an outlier.”
Currently, clustering methods are extensively studied, however, a clustering method for sequence should be handled in a way different from the conventional clustering methods in various respects.
First, the temporal relationship between points must be considered in sequence clustering.
Second, in conventional methods, one object as a target to be clustered is represented by one point and thus belongs to a single cluster, while in the sequence clustering method, one object is represented by multiple points, such that the points belong to multiple separate clusters.
Third, the shapes of the clusters may also be considered differently. Conventional clustering methods tend to focus on the cluster's own quantitative property, independent of how the clusters will be utilized in the future. In other words, conventional clustering methods concentrate on the problem of determining a certain number of clusters for optimizing given criteria such as the Mean Square Error (MSE). Therefore, the shapes of clusters are determined arbitrarily according to the distribution of points in the data space. On the contrary, sequence storage and retrieval in the future must be considered along with the clustering itself, such that these considerations must be reflected on the sequence clustering.
Conventional methods for clustering data points in a multidimensional space can include the following methods.
First, there is a method named “CLARANS” proposed in a thesis entitled “Efficient and effective clustering methods for spatial data mining” by “R. T. Ng and J. Han” and published in “Proceedings of Int'l Conference on Very Large Data Bases”. The CLARANS method is based on a randomized search method and achieves its efficiency by reducing the search space using two user-supplied input parameters.
Second, there is a method named “BIRCH” proposed in a thesis entitled “BIRCH: An efficient data clustering method for very large databases” by “T. Zhang, R. Ramakrishnan, and M. Livny” and published in “Proceedings of ACM SIGMOD Int'l Conference on Management of Data”. The “BIRCH” method is a multiphase clustering method for constructing a hierarchical data structure called CF (clustering feature)-tree by scanning a database. Further, the BIRCH uses an arbitrary clustering algorithm so as to cluster leaf nodes of the CF-tree. Such a method is the first approach that effectively handles outliers in the database area.
Third, there is a method named “DBSCAN” proposed in a thesis entitled “A density-based algorithm for discovering clusters in large spatial databases with noise” by “M. Ester, H. P. Kriegel, J. Sander, and X. Xu” and published in “Int'l Conference on Knowledge Discovery in Databases and Data Mining”. The “DBSCAN” method tries to minimize requirements of domain knowledge to determine input parameters and provides arbitrary shapes of clusters based on the distribution of data points. The basic idea of the method is that for each point in a cluster, the neighborhood of the point within a given radius should contain at least a given number of points. Therefore, the method requires only two input parameters (i.e., radius and the minimum number of points).
Fourth, there is a method named “CLIQUE” proposed in a thesis entitled “Automatic subspace clustering of high dimensional data for data mining applications” by “R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan” and published in “Proceedings of ACM SIGMOD Int'l Conference on Management of Data”. “CLIQUE” is a method for automatically identifying dense clusters in subspaces of a given high-dimensional data space. That is, the method is suitable where even though a cluster is not detected in a given space, the cluster can exist in the subspaces. Further, the method needs the size of the grid for partitioning the space and the global density threshold for clusters as the input parameters.
Fifth, there is a method named “CURE” proposed in a thesis entitled “CURE: An efficient clustering algorithm for large databases” by “S. Guha, R. Rastogi, and Shim” and published in “Proceedings of ACM SIGMOD Int'l Conference on Management of Data”. The “CURE” as a recent method identifies clusters having non-spherical shapes and wide variances in size. In such a method, each cluster is represented with multiple well-scattered points. The shape of a non-spherical cluster is better represented when more than one point are used. Such a clustering algorithm finishes the clustering process when the number of generated clusters reaches a given value as an input parameter.
However, the conventional clustering methods require multiple input parameters, and do not consider temporal and semantic relationship between data points. Consequently, the conventional clustering methods are problematic in that they cannot be applied to the clustering methods of data sequences such as video clips, in which temporal and semantic relationship between frames are regarded as important.