The visual content in an image may be represented by a set of features such as texture, color and shape. In a database, the features of an image can be represented by a set of numerical numbers, termed a feature vector. Various dimensions of feature vectors are used for content-based retrieval. In this patent, the term “spatial data” refers to those data that are two-dimensional (2D) and three-dimensional (3D) points, polygons, and points in some d-dimensional feature space. In this patent, we disclose a novel data clustering method in the spatial data-mining problem.
Spatial data mining is the discovery of characteristics and patterns (hopefully interesting characteristics and patterns) that may exist in large spatial databases. Usually the spatial relationships are implicit in nature. Because of the huge amounts of spatial data that may be obtained from satellite images, medical equipment, Geographic Information Systems (GIS), image database exploration etc., it is expensive and unrealistic for the users to examine spatial data in detail. Spatial data mining aims to automate the process of understanding spatial data by representing the data in a concise manner and reorganizing spatial databases to accommodate data semantics. It can be used in many applications such as seismology (grouping earthquakes clustered along seismic faults), minefield detection (grouping mines in a minefield), and astronomy (grouping stars in galaxies), among a myriad of other applications.
The aim of data clustering methods is to group the objects in spatial databases into meaningful subclasses. Due to the huge amount of spatial data, an important challenge for clustering algorithms is to achieve good time-efficiency. Also, due to the diverse nature and characteristics of the sources of the spatial data, the clusters may be of arbitrary shapes. They may be nested within one another, may have holes inside, or may possess concave shapes. A good clustering algorithm should be able to identify clusters irrespective of their shapes or relative positions. Another important issue is the handling of noise. Noise objects (outliers) refer to the objects that are not contained in any cluster and should be discarded during the mining process. The results of a good clustering approach should not be affected by the different ordering of input data and should produce the same clusters. In other words, the results should be order insensitive with respect to input data.
The complexity and enormous amount of spatial data may hinder the user from obtaining any knowledge about the number of clusters present. Thus, clustering algorithms should not assume to have the input of the number of clusters present in the spatial domain. To provide the user maximum effectiveness, clustering algorithms should classify spatial data at different levels of detail. For example, in an image database, the user may pose queries like whether a particular image is of type agricultural or residential. Suppose the system identifies that the image is of agricultural category and the user may be just satisfied with this broad classification. Again, the user may inquire about the actual type of the crop that the image shows. This requires clustering at hierarchical levels of coarseness which we call the multi-resolution property.
In the description of the present invention which follows, we cite to the following references:    [AF97] D. Allard and C. Fraley. Non parametric maximum likelihood estimation of features in spatial process using voronoi tesselation. Journal of the American Statistical Association, December 1997.    [AGGR98] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the AGM SIGMOD Conference on Management of Data, pages 94-105, Seattle, Wash., 1998.    [BR95] S. Byers and A. E. Raftery. Nearest neighbor clutter removal for estimating features in spatial point processes. Technical Report 295, Department of Statistics, University of Washington, 1995.    [COM95] Special Issue on Content-Based Image Retrieval Systems, Editors V. N. Gudivada and V. V. Raghaven, IEEE Computer, 28(9), 1995.    [EKSX95] M. Ester, H. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of 2nd International Conference on KDD, 1996.    [EKSX98] M. Ester, H. Kriegel, J. Sander, and X. Xu. Clustering for mining in large spatial databases. KI-Journal, 1998. Special Issue on Data Mining, ScienTec Publishing.    [Gor81] A. D. Gordon. Classification Methods for the Exploratory Analysis of Multivariate Data. Chapman and Hall, 1981.    [HJS94] Michael L. Hilton , Bjorn D. Jawerth, and Ayan Sengupta. Compressing Still and Moving Images with Wavelets. Multimedia Systems, 2(5):218-227, December 1994.    [Hor88] Berthold Klaus Paul Horn. Robot Vision. The MIT Press, forth edition, 1988.    [JFS95] Charles E. Jacobs, Adam Finkelstein, and David H. Salesin. Fast multiresolution image querying. In SIGGRAPH 95, Los Angeles, Calif., August 1995.    [JM95] R. Jain and S. N. J. Murthy, Similarity Measures for Image Databases. In Proceedings of the SPIE Conference on Storage and Retrieval of Image and Video Databases III, pages 58-67, 1995.    [Knu98] Donald E. Knuth. The Art of Computer Programming. Addison-Wessley, third edition, 1998.    [KR90] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.    [Mal89a] S. Mallat. Multiresolution approximation and wavelet orthonormal bases of L2 ®. Transactions of American Mathematical Society, 315:69-87, September 1989.    [Mal89b] S. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:674-693, July 1989.    [NH94] R. T. Ng and J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th VLDB Conference, pages 144-155, Santiago, Chile, 1994.    [NS80] D. Nassimi and S. Sahni. Finding connected components and connected ones on a mesh-connected parallel computer. SIAM Journal on Computing, 9:744-757, 1980.    [Ope77] S. Openshaw. A geographical solution to scale and aggregation problems in region-building, partitioning and spatial modelling. Transactions of the Institute of British Geographers, 2:459-472, 1977.    [OT81] S. Openshaw and P. Taylor. Quantitive Geography: A British View, chapter The Modifiable Areal Unit Problem, pages 60-69. London: Routledge, 1981.    [PFG97] E. J. Pauwels, P. Fiddelaers, and L. Van Gool. DOG-based unsupervised clustering for CBIR. In Proceedings of the 2nd International Conference on Visual Information Systems, pages 13-20, San Diego, Calif., December 1997.    [SC94] J. R. Smith and S. Chang. Transform Features For Texture Classification and Discrimination in Large Image Databases. In Proceedings of the IEEE International Conference on Image Processing, pages 407-411, 1994.    [Sch92] Robert Schalkoff. Pattern Recognition: Statistical, Structural and Neural Approaches. John Wiley & Sons, Inc., 1992.    [SCZ98] G. Sheikholeslami. S. Chatterjee, and A. Zhang. WaveCluster. A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In Proceedings of the 24th VLDB conference, pages 428-439, New York City, August 1998.    [SN96] G. Strang and T. Nguyen. Wavelets and Filter Banks. Wellesley-Cambridge Press, 1996.    [SV82] Y. Shilaoch and U. Vishkin. An O(logn) parallel connectivity algorithm. Journal of Algorithms, 3:57-67, 1982.    [SZ97] G. Sheikholeslami and A. Zhang. An Approach to Clustering Large Visual Databases Using Wavelet Transform. In Proceedings of the SPIE Conference on Visual Data Exploration and Analysis IV, pages 322-333, San Jose, February 1997.    [SZB97] G. Sheikholeslami, A. Zhang, and L. Brian. Geographical Data Classification and Retrieval. In Proceedings of the 5th ACM International Workshop on Geographical Information Systems, pages 58-61, Las Vegas, Nev., November 1997.    [URB97] Greet Uytterhoeven, Dirk Roose, and Adhemar Bultheel. Wavelet transforms using lifting scheme. Technical Report ITA-Wavelets Report WP 1.1, Katholieke Universiteit Leuven, Department of Computer Science, Belgium, April 1997.    [Vai93] P. P. Vaidyanathan. Multirate Systems and Filter Banks. Prentice Hall Signal Processing Series. Prentice Hall, Englewood Cliffs, N.J., 1993.    [WYM97] Wei Wang, Jiong Yang, and Richard Muntz. STING: A Statistical Information Grid Approach to Spatial Data Mining. In Proceedings of the 23rd VLDB Conference, pages 186-195, Athens, Greece, 1997.    {XMKS98] X. Xu, M. Ester, H. Kriegel, and J. Sander. A distribution-based clustering algorithm for mining in large spatial databases. In Proceedings of the 14th International Conference on Data Engineering, pages 324-331, Orlando, Fla., February 1998.    [YCSZ98] D. Yu, S. Chattetjee, G. Sheikholeslami, and A. Zhang. Efficiently detecting arbitrary shaped clusters in very large datasets with high dimensions. Technical Report 98-8, State University of New York at Buffalo, Department of Computer Science and Engineering, November 1998.    [ZM97] Mohamed Zait and Hammou Messatfa. A comparative study of clustering methods. Future Generation Computer Systems, 13:149-159, November 1997.    [ZRL96] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, 1996.
Thus, a longfelt need has existed for a wavelet-based method of managing spatial data in very large databases.