1. Field
Embodiments of the present invention relate to identifying semantic nearest neighbors in a feature space.
2. Related Art
With the advance of the Internet, there is an abundance of data of images, documents, music, videos, etc. As the size of the data continues to grow, the density of similar objects in the data space also increases. These objects are likely to have similar semantics. As a result, inferences based on nearest neighbors or objects can be more reliable than before.
Traditional methods for searching nearest neighbors in sub-linear time, such as the KD-tree, work well on data with limited feature dimensionality, but become linear in time as dimensionality grows. Recently, Locality Sensitive Hashing (LSH) has been successfully applied to datasets with high dimensional features. LSH uses random projections to map objects from feature space to bits, and treats those bits as keys for multiple hash tables. As a result, collision of similar samples in at least one hash bucket has a high probability. This randomized LSH algorithm has a tight asymptotic bound, and provides the foundation to a number of algorithmic extensions.
Parameter sensitive hashing is one such extension. It chooses a set of weak binary classifiers to generate bits for hash keys. The classifiers are selected according to the criteria that nearby objects in a dataset are more likely to have a same class label than more distant objects. A major drawback of this type of approach is the requirement of evaluation on object pairs, which has size quadratic to the number of objects. Hence, its scalability to larger scale datasets is limited.
Restricted Boltzmann machines (RBM) have also been used to learn hash functions, and have been used to show that the learned hash codes preserve semantic similarity in Hamming space. Training RBM is a computationally intensive process that makes it very costly to retrain the hash function when data evolves.
Spectral hashing takes a completely different approach to generate hash code. Spectral hashing first rotates feature space to statistically orthogonal axes using principal component analysis (PCA). Then, a special basis function is applied to carve each axis independently to generate hash bits. As a result, bits in a hash code are independent, which leads to a compact representation with short code length. Experiments show that spectral hashing outperforms RBM. However, spectral hashing is developed on the assumption that objects are spread in a Euclidean space with a particular distribution—either uniform or Gaussian. This is seldom true in a real world data set.