Recently, information processing systems are increasingly expected to handle large amounts of data such as, for example, news data, client information, patent information, and stock market data. Users of such databases find it increasingly difficult to search for desired information quickly and effectively with sufficient accuracy. Therefore, timely, accurate, and inexpensive detection of documents from large databases may provide very valuable information for many types of businesses. In addition, sometimes users wish to obtain further information related to data retrieved, such as cluster information in the database, and the interrelationships among such clusters.
Typical methods for detecting clusters rely upon a measure of similarity between data elements; such methods based on similarity search have been proposed so far as summarized below.
Similarity search (also known as proximity search) is one in which items of a database are sought according to how well they match a given query element. Similarity (or rather, dissimilarity) is typically modeled using some real- or integer-valued distance
‘metric’ dist: that is,
(1) dist(p, q)≧0 for all p, q (non-negativity);
(2) dist(p, q)=dist(q, p) for all p, q (symmetry);
(3) dist(p, q)=0 if and only if p=q;
(4) dist(p, q)+dist(q, r)≧dist(p, r) for all p, q, r (triangle inequality).
Any set of objects for which such a distance function exists is called a metric space. A data structure that allows a reduction in the number of distance evaluations at query time is known as an index. Many methods for similarity queries have been proposed. Similarity queries on metric spaces are of two general types, as stated below:
(A) k-nearest-neighbor query: given a query element q and a positive integer k, report the k closest database elements to q.
(B) range query: given a query element q and a distance r, report every database item p such that dist(p, q)≦r.
For large databases, it is too expensive to perform similarity queries by means of explicitly computing the distances from the query element to every database element. Previous computation and storage of all distances among database elements is also too expensive, as this would require time and space proportional to the square of the number of database elements (that is, quadratic time and space). A more practical goal is to construct a search structure that can handle queries in sub-linear time using sub-quadratic storage and preprocessing time.
A. Review of Vector Space Models
Current information retrieval methods often uses vector space modeling to represent the documents of databases. In such vector space models, each document in the database under consideration is associated with a vector, each coordinate of which represents a keyword or attribute of the document; details of the vector space models are provided elsewhere (Gerald Salton, The SMART Retrieval System—Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, N.J., USA, 1971).
B. Brief Survey of Similarity Search Structures
A great variety of structures have been proposed over the past thirty years for handling similarity queries. The majority of these are spatial indices, which require that the object set be modeled as a vector of d real-valued attributes. Others are ‘metric’ indices, which make no assumptions on the nature of the database elements other than the existence of a distance metric, and are therefore more widely-applicable than spatial search structures. For recent surveys of search structures for multi-dimensional vector spaces and metric spaces, see Gaede et al. (Volker Gaede and Oliver Gunther, Multidimensional Access Methods, ACM Computing Surveys, 30, 2, 1998, pp. 170-231.), and Chavez et al. (Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates and Jose L. Marroquin, Searching in metric spaces, ACM Computing Surveys 33, 3, 2001, pp. 273-321.).
The practicality of similarity search, whether it be on metric data or vector data, is limited by an effect often referred to as the ‘curse of dimensionality’. Recent evidence suggests that for the general problem of computing nearest-neighbor or range queries on high-dimensional data sets, exact techniques are unlikely to improve substantially over a sequential search of the entire database, unless the underlying distribution of the data set has special properties, such as a low fractal dimension, low intrinsic dimension, or other properties of the distribution.
For more information regarding data dimension and the curse of dimensionality, see (for example) Chavez et al. (op cito)), Pagel et al. (Bernd-Uwe Pagel, Flip Korn and Christos Faloutsos, Deflating the dimensionality curse using multiple fractal dimensions, Proc. 16th International Conference on Data Engineering (ICDE 2000), San Diego, USA, IEEE CS Press, 2000, pp. 589-598.), Pestov (Vladimir Pestov, On the geometry of similarity search: dimensionality curse and concentration of measure, Information Processing Letters, 73, 2000, pp. 47-51.), and Weber et al. (Roger Weber, Hans-J. Schek and Stephen Blott, A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces, Proc. 24th VLDB Conference, New York, USA, 1998, pp. 194-205).
C. Brief Survey of Approximate Similarity Searching
In an attempt to circumvent the curse of dimensionality, researchers have considered sacrificing some of the accuracy of similarity queries in the hope of obtaining a speed-up in computation. Details of these techniques are provided elsewhere, for example, by Indyk et al. (P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proc. 30th ACM Symposium on Theory of Computing, Dallas, 1998, pp. 604-613.), and Ferhatosmanoglu et al. (Hakan Ferhatosmanoglu, Ertem Tuncel, Divyakant Agrawal and Amr El Abbadi, Approximate nearest neighbor searching in multimedia databases, Proc. 17th International Conference on Data Engineering (ICDE), Heidelberg, Germany, IEEE CS Press, 2001, pp. 503-514.); for metric spaces, by Ciaccia et al. (Paolo Ciaccia and Marco Patella, PAC nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces, Proc. 16th International Conference on Data Engineering (ICDE 2000), San Diego, USA, 2000, pp. 244-255; Paolo Ciaccia, Marco Patella and Pavel Zezula, M-tree: an efficient access method for similarity search in metric spaces, Proc. 23rd VLDB Conference, Athens, Greece, 1997, pp. 426-435.) and Zezula et al. (Pavel Zezula, Pasquale Savino, Giuseppe Amato and Fausto Rabitti, Approximate similarity retrieval with M-trees, The VLDB Journal, 7, 1998, pp. 275-293.). However, these methods all suffer from deficiencies that limit their usefulness in practice. Some make unrealistic assumptions concerning the distribution of the data; others cannot effectively manage the trade-off between accuracy and speed.
D. Spatial Approximation Sample Hierarchy (SASH)
An approximate similarity search structure for large multi-dimensional data sets that allows significantly better control over the accuracy-speed tradeoff is the spatial approximation sample hierarchy (SASH), described in Houle (Michael E. Houle, SASH: a spatial approximation sample hierarchy for similarity search, IBM Tokyo Research Laboratory Research Report RT-0446, 18 pages, Feb. 18, 2002) and Houle, Kobayashi and Aono (Japanese Patent Application No. 2002-037842). The SASH requires a similarity function satisfying the conditions of a distance metric, but otherwise makes no assumptions regarding the nature of the data. Each data element is given a unique location within the structure, and each connection between two elements indicates that they are closely related. Each level of the hierarchy consists of a random sample of the elements, the sample size at each level roughly double that of the level immediately above it. The structure is organized in such a way that the elements located closest to a given element v are those that are most similar to v. In particular, the node corresponding to v is connected to a set of its near neighbors from the level above, and also to a set of items from the level below that choose v as a near neighbor.
E. Review of Clustering Techniques
The term clustering refers to any grouping of unlabeled data according to similarity criteria. Traditional clustering methods can generally be classified as being either partitional or hierarchical. Hierarchical techniques produce a tree structure indicating inclusion relationships among groups of data (clusters), with the root of the tree corresponding to the entire data set. Partitional techniques typically rely on the global minimization of classification error in distributing data points among a fixed number of disjoint clusters. In their recent survey, Jain, Murty and Flynn (A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: a review, ACM Computing Surveys 31, 3, 1999, pp. 264-323.) argue that partitional clustering schemes tend to be less expensive than hierarchical ones, but are also considerably less flexible. Despite being simple, fast (linear observed time complexity), and easy to implement, even the well-known partitional algorithm K-means and its variants generally do not perform well on large data sets. Partitional algorithms favor the generation of isotropic (rounded) clusters, but are not well-suited for finding irregularly-shaped ones.
F. Hierarchical Agglomerative Clustering
In a hierarchical agglomerative clustering, each data point is initially considered to constitute a separate cluster. Pairs of clusters are then successively merged until all data points lie in a single cluster. The larger cluster produced at each step contains the elements of both merged subclusters; it is this inclusion relationship that gives rise to the cluster hierarchy. The choice of which pairs to merge is made so as to minimize some inter-cluster distance criterion.
G. Shared-Neighbor Methods
One of the criticisms of simple distance-based agglomerative clustering methods is that they are biased towards forming clusters in regions of higher density. Well-associated groups of data in regions of low density risk not being discovered at all, if too many pairwise distances fall below the merge threshold. More sophisticated (and expensive) distance measures for agglomerative clustering have been proposed, that take into account the neighborhoods of the data elements. Jarvis et al. (R. A. Jarvis and E. A. Patrick, Clustering using a similarity measure based on shared nearest neighbors, IEEE Transactions on Computers C-22, 11, November 1973, pp. 1025-1034.) defined a merge criterion in terms of an arbitrary similarity measure dist and fixed integer parameters k>r>0, in which two data elements find themselves in the same cluster if they share at least a certain number of nearest neighbors. The decision as to whether to merge clusters thus does not depend on the local density of the data set, but rather as to whether there exists a pair of elements, one drawn from each, that share a neighborhood in a substantial way.
Jarvis and Patrick's method (op. cito) is agglomerative, and resembles the single-link method in that it tends to produce irregular clusters via chains of association. More recent variants have been proposed in an attempt to vary the qualities of the clusters produced: for example, by Guha et al. (S. Guha, R. Rastogi and K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25, 5, 2000, pp. 345-366.); by Ertoz et al. (Levent Ertoz, Michael Steinbach and Vipin Kumar, Finding topics in collections of documents: a shared nearest neighbor approach, University of Minnesota Army HPC Research Center Preprint 2001-040, 8 pages, 2001.); by Ertoz et al. (Levent Ertoz, Michael Steinbach and Vipin Kumar, A new shared nearest neighbor clustering algorithm and its applications, Proc. Workshop on Clustering High Dimensional Data and its Applications (in conjunction with 2nd SIAM International Conference on Data Mining), Arlington, Va., USA, 2002, pp. 105-115.); by Daylight Chemical Information Systems Inc., in URL address (http://www.daylight.com/); and by Barnard Chemical Information Ltd., in URL address (http://www.bci.gb.com/). Nonetheless, all variants still exhibit the main characteristics of agglomerative algorithms, in that they allow the formation of large irregularly-shaped clusters with chains of association bridging poorly-associated elements.
H. Review of Methods for Dimension Reduction
Latent semantic indexing (LSI) is a vector space model-based algorithm for reducing the dimension of the document ranking problem; see Deerwester et al. (Scott Deerwester, Susan T. Dumais, George W. Furnas, Richard Harshman, Thomas K. Landauer, Karen E. Lochbaum, Lynn A. Streeter, Computer information retrieval using latent semantic analysis, U.S. Pat. No. 4,839,853, filed Sep. 15, 1988, issued Jun. 13, 1989; Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41, 6, 1990, pp. 391-407.). LSI reduces the retrieval and ranking problem to one of significantly lower dimension so that retrieval from very large databases can be performed more efficiently. Another dimension-reduction strategy due to Kobayashi et al. (Mei Kobayashi, Loic Malassis, Hikaru Samukawa, Retrieval and ranking of documents from a database, IBM Japan, docket No. JP9-2000-0075, filed Jun. 12, 2000; Loic Malassis, Mei Kobayashi, Statistical methods for search engines, IBM Tokyo Research Laboratory Research Report RT-413, 33 pages, May 2, 2001.) provides a dimensional reduction method called COV, which uses the covariance matrix of the document vectors to determine an appropriate reduced-dimensional space into which to project the document vectors. LSI and COV are comparable methods for information retrieval; for some databases and some queries, LSI leads to slightly better results than COV, while for others, COV leads to slightly better results.