Visual search systems are known and operate to use captured images as “queries” for a database of reference images in order to retrieve information related to the content of the captured image. For example, after taking a photo of the facade of a museum, a user's smartphone that was used to capture the image processes the image to generate feature descriptors that effectively describe the image for purposes of the query. In such a situation, which is a rapidly growing area of visual search research and development, the smartphone thereafter communicates the generated feature descriptors to a remote system containing a database, and searches or queries the database using the feature descriptors to identify the captured image in the database. The remote system may thereafter communicate to the smartphone the results of this query for presentation to the user, such as the location, opening times, and ticket costs where the captured image is a museum that the user is interested in visiting.
A typical visual search pipeline includes an interest-points detector, a features descriptor generator, a matching stage, and a geometry consistency checker. The most successful visual search techniques make use of invariant features, where the term invariant refers to the ability of the detection algorithm to detect image points and describe its surrounding region in order to be tolerant to affine transformation, like rotation, translation and scaling. In the current state of the art there exist many invariant feature extraction and description algorithms. The Scale Invariant Feature Transform (SIFT) algorithm is often used as a reference algorithm because it is typically the algorithm that provides the best recognition rate (i.e., matching of feature descriptors associated with a captured image with the proper corresponding image in an image database). The computational costs of the SIFT algorithm, however, are quite high (e.g., less than 2 frames per second (FPS) at VGA resolution on a modern desktop computer) so there are other algorithms, like the Speeded Up Robust Features (SURF) algorithm, that sacrifice precision as obtained with the SIFT algorithm for improved speed.
In the typical visual search pipeline, the matching stage simply consists of a many to many comparison between the feature descriptors in the scene and the ones stored in the database using a predefined metric function which is dependent on the description algorithm (e.g. ratio of L2 norms for SIFT feature descriptors). Finally, the geometry consistency checker, such as a Random Sample Consensus (RANSAC) checker, processes the entire set of matched feature descriptors to retrieve a valid transformation model for the query and reference images with the goal of removing false matches (i.e., outliers) and increasing the recognition quality of the algorithm. Typical visual search applications utilize pair-wise matching, which simply compares two images, and a more complex analysis named retrieval in which a query image is looked up inside of a reference image data set that is potentially a very large image database like Google Images™ or Flickr™.
Each of the phases described above requires high computational costs due to the amount of data involved or the complexity of the calculations. Also the number of bytes (i.e., the length) used in the feature descriptor is an important factor for a potential transmission overhead in a client-server environment, such as where a mobile device like a smartphone (i.e., the client) is the image capture device, as well as for amount of storage space required to store all the desired the content of the image database (i.e., the server). Improvements in the matching stage have been proposed, such as by the creator of the SIFT algorithm who proposed the use of KD-Trees to approximate the lookup of the nearest feature vector inside a database. Other improvements have been applied to the geometry consistency checker using algorithms that are less computationally intensive than RANSAC, such as the DISTRAT algorithm. Finally the compression and transmission of the image's features in the form of the feature descriptors is the core topic of Motion Pictures Expert Groups (MPEG) standardization group named Compact Descriptors for Visual Search (CDVS). For example, the Compress Histogram of Gradients (CHOG) algorithm is a feature descriptor algorithm or approach designed to produce a compact representation by applying sampling and discretization to SIFT-like feature descriptors.