The present invention generally relates to information query methodologies and, more particularly, to methods and apparatus for performing fast query approximation using adaptive query vector projection.
The power and appeal of today""s database systems and digital libraries is their support for complex material such as images, video, audio and time-series data. Multimedia databases are gaining new functionalities that allow searching by the audio, visual and textual features of the content. For example, content-based image query methods use feature vectors to search for images where the feature vector proximity determines the image similarity. In general, the content-based query methods do not scale well to large databases since the high-dimensionality of the vectors makes them incompatible with traditional multi-dimensional indexing methods.
There are many approaches for indexing multi-dimensional data in databases. Multi-dimensional index structures such as R-trees, SS-trees and SR-trees are well suited for vectors of a few dimensions but become extremely inefficient for more than ten dimensions. Since content-based query methods often utilize vectors with greater than 256 dimensions, other approaches such as dimensionality reduction and pre-;filtering are needed.
Dimensionality reduction techniques such as singular value decomposition (SVD) and the discrete cosine transform (DCT) can be used to compact high-dimensional vectors into a few dimensions. SVD generates the optimal linear transformation that compacts the most energy into the fewest dimensions. However, SVD needs to be trained and problems arise for dynamic databases. Furthermore, there is no guarantee that the query vectors will be compacted well by the SVD transform derived from training on the target data. SVD provides no worst case performance bound. Certain pathological queries that do not match the training data can be poorly represented using the SVD approach. On the other hand, the DCT is independent of the data and needs no training. However, since it provides only a fixed set of basis functions, DCT may not work well for all data sets. Other transformations such as the discrete Fourier transform (DFT) and Haar transform provide different fixed sets of basis functions, but generally have the same limitations as DCT.
While dimensionality reduction techniques can speed up querying by using fewer dimensions, they also allow the data to be indexed by multi-dimensional index structures. The integration of dimensionality reduction and multi-dimensional indexing has been investigated in several algorithms and system, such as FastMap, RCSVD, SVDD, and QBIC. The FastMap algorithm compacts time-series data into two-dimensional streams that are indexed by the R-tree. RCSVD integrates recursive clustering with SVD in order to better compact multi-dimensional vectors. RCSVD indexes the compacted vectors using an R-tree.
The IBM (of Armonk, N.Y.) Query by Image Content (QBIC) system uses dimensionality reduction in a staged query system in order to index color images using high-dimensional color histograms. In the first stage, the images are indexed using only three dimensions, which are derived from the mean image color in the three color channels. This stage pre-filters the next stage, in which the actual histogram distances are computed for the smaller set of surviving items.
The present invention provides a new query-centric approach that addresses the shortcomings of the existing dimensionality reduction methods. Key features of the methodology of the invention are that the system stores a redundant projection library of transformation building blocks for the target vectors. Then, the system rapidly searches for the optimal transformation at query-time. The projection library provides an extremely large number of different ways for compacting the vectors. For example, for 128-dimensional vectors, the projection library of 448 projection elements provides O(1016) unique complete transformations and O(1033) unique incomplete projections. The invention involves the use of a fast strategy for searching the projection library at query-time to select the best set of projection elements for computing the query.
Rather than training on the target data to select one particular transformation, such as in SVD, the present invention selects the best transformation, on-the-fly, by analyzing the query vector. In short, the method selects the elements that correspond to the most significant space and frequency dimensions of each specific query vector. This eliminates the potential for mismatch between the query vector and the transformation, as happens with SVD, DCT, DFT, and Haar transform. This results in greater compaction of the query vector, improved efficiency in processing each query, and speed-up in query response time.
In order to achieve the query efficiency, the system stores redundant data in the projection library. This may be done by compressing and efficiently storing the projection elements. The approach has many advantages. First, for example, the projections of the query vector can be either complete or incomplete. If complete, the system computes the query with perfect precision. If incomplete, the system approximates the query. Since the system can select incomplete projections to compact the energy of the query vectors into any number of dimensions, the method allows a flexible query-time tradeoff between query precision and query response time.
The present invention is suitable for scalable content-based query systems since it works by adaptively selecting and utilizing the most significant space and frequency dimensions of each specific query vector in order to efficiently process each query.
More specifically, the present invention is directed towards apparatus and methods for querying by similarity in databases of high-dimensional vectors. The invention comprises projecting the query vector into a redundant library and selecting the best set of projection elements for processing the query. The invention also comprises a mathematical decomposition of the vectors into the library that guarantees that any complete set of projection elements represents the vectors perfectly and can be used to compute the vector proximities with perfect precision. Furthermore, the invention also provides for small and non-redundant sets of projection elements to form approximations of the vectors and the vector proximities. The invention provides for selecting non-redundant sets of projection elements to efficiently compute the similarity queries while admitting small losses in precision.
The invention provides for determining the query efficiency at query time by selecting the transformation of the query vector using a large number of alternative projections, on the following bases: (i) efficiencyxe2x80x94the size of the selected element set, or equivalently, the dimensionality of the transformed query vector; (ii) query vector precisionxe2x80x94the approximation of the reconstructed projected query vector to the actual query vector; and (iii) query results precisionxe2x80x94the bound on the error in the computed similarity between the query and the target vectors.
The invention provides a framework for greatly improving query efficiency and allowing the user to tradeoff query-response time and query-precision.