1. Field of the Invention
The present invention relates to methods and systems for indexing objects in high dimensional data spaces to respond to user queries.
2. Description of the Related Art
Nearest neighbor searching on high dimensional data spaces is essentially a method of searching for objects in a data space that are similar to a user-selected object, with the user-selected object defining a query. For example, using the present assignee's QBIC system, a user can select a digital image and use the image as a query to a data base for images that are similar to the user-selected digital image. In response to the query, the “k” closest images are returned, where “k” is an integer defined by the user or search engine designer. These “k” images are referred to as the “k” nearest neighbors to the image that was used as the query, and for indexing and search purposes they are typically considered to be multidimensional data points “p” that are close to a multidimensional data point “q” representing the query. Other non-limiting examples of applications that use nearest neighbor searching include video databases, data mining, pattern classification, and machine learning.
In any case, multidimensional indexing methods (“MIMs”) have been introduced for indexing multidimensional objects by partitioning the data space, clustering data according to the partitioning, and using the partitions to prune the search space to promote fast query execution. It will readily be appreciated that in the context of large databases that hold a high number of objects, the time to execute a query like the one discussed above would be excessive in the absence of MIMs. As recognized by the present invention, while effective for low dimensionalities, MIMs are not effective and indeed tend toward being counterproductive for objects having high dimensionalities, e.g., of ten, twenty or more. Image objects, for example, can have hundreds of dimensions, and text documents can have thousands of dimensions.
Weber et al. disclose a filtering method intended to be an improvement over conventional MIMs in “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces”, Proc. of the 24th Int'l Conf. on VLDB, 1998 (“VA file” method). In the VA file method, compact approximations of data objects (also referred to as “vectors”) are generated, and by first scanning the compact approximations, a large number of the larger actual vectors can be filtered out such that only a small number of vectors need be examined. In this way, query execution time is minimized.
The present invention has recognized, however, that the VA file method has at least two drawbacks. The first is that as the dimensionality of the data objects increases, the number of bits used in the approximations also increases significantly to facilitate adequate filtering. This means that the performance of the VA file method, like the performance of the above-mentioned MIMs, degrades significantly when applied to high dimensional data spaces (e.g., dimensions over 100). The second drawback with the VA file method is that its filtering capability decreases in the case of clustered data such as multimedia data. The present invention, having recognized the above-noted deficiencies in the prior art, has provided the improvements disclosed below.