It is often desirable to detect related instances in a database that correspond to related information. Instances of information are represented and stored in a database in the form of a set of data values (feature vectors) in a multidimensional space. A dimension in the multidimensional space is a feature that characterizes objects represented by the data values. For example, consider the database of a credit card company containing customer information. Each customer is an object corresponding to an instance in the database that is a customer profile, or data value, in the multidimensional feature space of the database. Each data value is an n-tuple corresponding to an instance features, such as, age, sex, and salary. The dimensions of the multidimensional feature space are the features that characterize the customer namely the age, sex, and salary, as well as other information.
The nearest-neighbor problem is the problem of performing a similarity search to find related ("similar") instances of information in the database. Data values in the database are deemed related if they lie a short distance from each other in the multidimensional feature space of the database. Specifically, the nearest-neighbor problem is that of finding the data value or set of k data values in the database that is closest ("similar"), in the sense of a distance metric, to a given target value.
The nearest-neighbor problem is an important problem with applications in data mining, collaborative filtering, and multimedia image database searching. See, for example, Roussopoulos N., Kelley S., and Vincent F., "Nearest neighbor Queries", Proceedings of the ACM-SIGMOD International Conference on Management of Data, pp. 71-79, 1995. Data mining involves extracting structured information from large amounts of raw data. For example, a database belonging to a credit card company may contain demographic information including age, sex, and salary. By finding entries in the database with similar demographics it may be possible to deduce the risk associated with a particular debt. Collaborative filtering involves deducing the preferences of a given user by examining information about the preferences of other similar users. For example, suppose that customers of a music Compact Disc (CD) distributor asks customers to rank the CDs they like. If the distributor sells many different CDs, then it is likely that customers may not spend the time to rank all of the CDs available. For the purposes of marketing and sales promotions it may be desirable to try to predict the rank a particular customer may give to a particular CD, when the particular CD had not been previously ranked by that customer. Such a prediction may be produced, for example, by the average ranking of that particular CD, by other customers with similar characteristics to the particular customer.
Another example of an application of the nearest-neighbor problem is a text similarity search. For a given target web page, it may be desirable to find all other pages which are similar to it. Suppose that web pages are described in terms of keywords. In this case, a feature corresponds to a keyword. Finding the nearest-neighbor can be achieved by identifying keywords in the target and then finding neighbor WebPages based on the identified keywords.
Similarity searching may also be an important tool for image database applications. It may be useful to allow a user to search an image database for similar images. Features of an image may represent information related to color histograms, texture, luminescence, chrominescence or other characteristics of an image. Images may be deemed similar based on the likeness of image features.
Various methods for finding nearest-neighbors have been proposed, see for example, Roussopoulos N., Kelley S., and Vincent F., "Nearest neighbor Queries", Proceedings of the ACM-SIGMOD International Conference on Management of Data, pp. 71-79, 1995; Berchtold S., Keim D., and Kriegel H. P., "The X-Tree: An Index Structure for High Dimensional Data", Proceedings of the 22.sup.nd International Conference in Very Large Databases, pp. 28-39, 1996; and Berchtold S., Ertl B., Keim D. A., Kriegel H. P., and Seidel T., "Fast Nearest Neighbor Search in High-dimensional space", Proceedings of the International Conference on Data Engineering, pp. 209-218, February, 1998. These previously proposed methods, however, do not allow the flexibility of performing similarity searches based on a subset, specified by a user, of the set of attributes characterizing objects represented by the data values in the database. Further, these previously proposed methods do not allow for the flexibility of identifying an entry in a database that is most similar to a multiplicity of target entries. Moreover, some of the previously proposed methods do not allow a user to search for more than one entry in the database that is similar to a given target entry.