While there are many prior proposals for indexing high-dimensional data (for a survey, see H. Samet, Foundations of Multidimensional and Metric Data Structures, Morgan Kaufmann, 2006), they have all been shown to suffer from the so-called dimensionality curse (see R. Bellmann: Adaptive Control Processes: A Guided Tour. Princeton Univ. Press (1961)), which means that when the dimensionality of the data goes beyond a certain limit, all search and indexing methods aiming exact answers to the problem have shown to perform slower than a sequential scan of the data signature collection. As a result none of these approaches has been shown to be applicable to large data signature sets.
One paradigm for attacking the dimensionality curse problem is to project high-dimensional data signatures to random lines, which was introduced by Kleinberg (see Two Algorithms for Nearest-Neighbor Search in High Dimensions, Jon M. Kleinberg, 1997) and subsequently used in many other high-dimensional indexing techniques. Such projections have two main benefits. First, in some cases, they can alleviate data distribution problems. Second, they allow for a clever dimensionality reduction, by projecting to fewer lines than there are dimensions in the data.
Fagin et. al. presented in their paper, “Efficient similarity search and classification via rank aggregation (Proceedings of the ACM SIGMOD, San Diego, Calif., 2003)” an algorithm called (O)MEDRANK for projecting the data signatures to a single random line per index and storing the identifiers organized in a B+-tree on a data store. This algorithm is described in the US 20040249831 patent application.
Since the OMEDRANK algorithm needs B+-trees for its query retrieval, Lejsek et. al in their paper, “A case-study of scoring schemes for the PvS-index”, in the proceedings of CVDB, Baltimore, Md., 2005 proposed an enhanced version of the OMEDRANK algorithm called the PvS-index, which redundantly saves these B+-trees to disk for fast lookup. The PvS-index suffers, however, from its static nature, which does not support updates as soon as nodes need to be split. Further drawbacks are the limited number of random lines (one line per hierarchy), the insufficient disk storage by using multiple B+-trees and its tight tie to the OMEDRANK algorithm and the Euclidean distance.
Another strategy in high-dimensional indexing follows the idea of Locality Sensitive Hashing (LSH), published by Indyk et al. in “Similarity search in high dimensions via hashing”, in the Proceedings of VLDB, Edinburgh, 1999 and “Locality-sensitive hashing using stable distributions”, MIT Press, 2006. LSH is not based on a sorted tree structure, but on hashing the data signatures into buckets. The hash function is constructed by projecting each data signature onto a small set of random lines with fixed cardinality. Each of the projections is categorized into buckets and each of these buckets is assigned an identifier. By concatenating all the identifiers of the projections a hash value is constructed and all data signatures resulting the same hash value are stored together on the data store.
Joly et. al have shown in “Content-Based Copy Detection using Distortion-Based Probabilistic Similarity Search” in IEEE Transactions on Multimedia, 2007 a video-retrieval system based on Hilbert-Space-Filling-Curves for fast high-dimensional retrieval. This method has, however, tuned especially for this specific and rather low-dimensional application, which still needs a sequential scan at the end of the query processing.