Augmented reality is a significant application to leverage recent advances in computing devices, and more particularly mobile devices. The mobile devices can include clients such as mobile telephones (cell phone), personal digital assistants (PDA), a tablet computers and the like. Such devices have limited memory, processing, communication, and power resources. Hence, augmented reality applications present a special chalange in mobile environments.
The devices can acquire images or videos using either a camera or network. The images can be of real world scenes, or synthetic data, such as computer graphic images or animation videos. Then, the devices can augment the experience for a user by overlaying useful information on the images or videos. The useful information can be in the form of metadata.
For example, the metadata can be information about a historical landmark, nutrition information about a food item, or a product identified with a (linear or matrix) bar in an image.
To enable such applications, it is necessary to exploit recent advances in image recognition, while recognizing the limitations on the device resources. Thus, in a typical augmented reality application, the mobile device must efficiently transmit the salient features of a query image to a database at a server that stores a large number of images or videos. The database server should quickly determine whether the query image matches an entry in the database, and return suitable metadata to the mobile device.
Many image-based augmented reality applications use scale-invariant feature transform (SIFT), speeded up robust feature (SURF), and GIST, see e.g., U.S. 20110194737.
SIFT and SURF acquire local details in an image, and therefore, have been used to match local features or patches. They can also be used for image matching and retrieval by combining hypotheses from several patches using, for example, the popular “Bag-of-Features” approach. GIST acquires global properties of the image and has been used for image matching. A GIST vector is an abstract representation of a scene that can activate a memory representations of scene categories, e.g., buildings, landscapes, landmarks, etc.
SIFT has the best performance in the presence of common image deformations, such as translation, rotation, and a limited amount of scaling. Nominally, the SIFT feature vector for a single salient point in an image is a real-valued, unit-norm 128-dimensional vector. This demands a prohibitively large bit rate required for the client to transmit the SIFT features to a database server for the purpose of image matching, especially if features from several salient points are needed for reliable matching.
A number of training-based methods are known for compressing image descriptors. Boosting Similarity Sensitive Coding (BoostSSC) and Restricted Boltzmann Machines (RBM) are known for learning compact GIST codes for content-based image retrieval. Semantic hashing has been transformed into a spectral hashing problem, in which it is only necessary to calculate eigenfunctions of the GIST features, providing better retrieval performance than BoostSSC and RBM.
Besides these relatively recently developed machine learning methods, some conventional training-based techniques such as Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA) have also been used to generate compact image descriptors. In particular, PCA has been used to produce small image descriptors by applying techniques such as product quantization, and distributed source coding. Alternatively, small image descriptors can be obtained by applying LDA to SIFT-like descriptors followed by binary quantization.
While training-based methods perform accurately in conventional image retrieval, they can become cumbersome in augmented reality applications, where the database continuously evolves as new landmarks, products, etc. are added, resulting in new image statistics and necessitating repeated training.
As a source coding-based alternative to training-based dimensionality reduction, a low-bit rate descriptor uses Compressed Histogram of Gradients (CHoG) specifically for augmented reality applications. In that method, gradient distributions are explicitly compressed, resulting in low-rate scale invariant descriptors.
Other techniques are known for efficient remote image matching based on Locality Sensitive Hashing (LSH), which is computationally simpler, but less bandwidth-efficient than CHoG, and does not need training. Random projections are determined from scale invariant features followed by one-bit quantization. The resulting descriptors are used to establish visual correspondences between images acquired in a wireless camera network. The same technique can be applied to content-based image retrieval, and a bound is obtained for the minimum number of bits needed for a specified accuracy of nearest neighbor search. However, those methods do not consider a tradeoff between dimensionality reduction and quantization levels.