The following description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
With the advent use of portable devices such as smart phones, tablets, phablets, etc. and applications related to augmented reality, there is an increase in need of fast and accurate recognition of objects based on image data that does not require a lot of memory space. Various efforts have been placed in improving the scope, accuracy, compactness, efficiency or speed of image recognition technologies. For example, “Searching In One Billion Vectors: Re-Rank With Source Coding”, by Herve Jegou, Romain Tavenard et al. (International Conference on Acoustics, Speed and Signal Processing, Prague: Czech Republic (2011)) proposes an alternative to the standard post-verification scheme that could require less memory, and is potentially more cost effective. Unfortunately, the proposed efforts suffer various disadvantages, including for example, a slower response time.
As another example, International Patent Application No. 2013/056315 to Vidal et al. describes a method for classifying objects from training images by extracting features, clustering the features into groups of features (visual words), storing visual words with color and texture information, generating a vocabulary tree to store clusters of visual words with common characteristics, and using the trained classification process to classify objects in images.
Similarly, U.S. Pat. No. 7,680,341 to Perronnin describes a method of classifying an image that includes the steps of extracting model fitting data from an image respective to a generative model embodying a merger of a general visual vocabulary and an image class-specific visual vocabulary; “Video Google: A Text Retrieval Approach to Object Matching in Videos”, by Josef Sivic and Andrew Zisserman describes methods that include a step of building a visual vocabulary from sub-parts of a movie by vector quantization of descriptors into clusters using K-means clustering; “Object Categorization by Learned Universal Visual Dictionary”, by J. Winn et al., describes clustering using a K-means approach and estimating cluster centres to define a visual dictionary; “Probabilistic Appearance Based on Navigation and Loop Closing”, by Mark Cummins and Paul Newman discusses that an observation of a scene based on a “bag of words”; U.S. Patent Application Publication No. 2013/0202213 to Adamek et al. describes an offline process wherein a larger number of descriptor examples are clustered into a vocabulary of visual words, which defines a quantization of a descriptor space; and CN 102063472 to Lingyu Duan et al. describes a method in which a client side (1) obtains an image and relevance information, (2) sends the relevance information to a server that searches a vision word dictionary in a vision dictionary library inside the server, and (3) obtains a vision word of the image.
A University of Oxford publication titled “Scalable Object Retrieval in Very Large Image Collections” by James Philbin, published in 2010, discloses to build vocabulary with 500,000 to 1 million words using approximate K-mean algorithm. Philbin discussed that increased vocabulary size does not improve the accuracy of image recognition. Rather, the recognition accuracy depends on database images that are used to build vocabulary.
These and all other extrinsic materials and publications identified herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
Unfortunately, known efforts apparently fail to optimize the number of descriptors that could be represented by, or associated with, a compact dictionary, and have apparently failed to appreciate that descriptor size impede client-server communication. For example, SIFT descriptors could be up to 128 bytes in size. A smart phone that captures an image might generate several hundred or thousand descriptors resulting in hundreds of kilobytes of data that needs to be transmitted from one device to another. Unfortunately, many smart phones communicate over bandwidth, cost and latency sensitive wireless channels. Sending such relatively large amounts of data over a wireless change negatively impacts the user experience during a recognition activity as well as decrease responsiveness of the devices.
Thus, there is still a need for improved systems and methods of image recognition and content information retrieval.