1. Field of the Invention
The present application is generally directed to image search and more specifically, to improving image search in large-scale databases.
2. Description of the Related Art
There have been several attempts to address the problem of searching for a specific planar object within a large-scale image database. Given a query image which contains a particular object with a planar surface, the objective is to find, from a large image corpus, a set of representative images in which that object appears. According to the complexity of geometric transformations between a query image and its target images, the problem can be categorized into two classes: Rotation-Scale-Translation (RST)-transformed image search, and affine/homography-transformed image search. Two concrete applications based on the former technique are EMM identification and partial-duplicate image detection. In EMM identification, query images are phone-captured copies of a source image with slightly 3D view-point changes; therefore the major geometric changes between two matched images can be implied by RST transformations. Similarly, in partial-duplicate image detection, image variations are mainly due to 2D digital-editing techniques applied on a source image; hence a RST transformation can well account for the geometric changes between matched images in this case.
However, in many cases, there may be images with more general and complex transformations other than RST, such as affine transformation or even homography transformations. For example, a user may capture a picture of a movie poster on the street from a certain aspect. Based on this captured query image, the user may want to search its high-quality images or related online film review. Applications in this scenario have utilized on the latter technique “affine/homography-transformed image search” for a better search precision.
Bag-Of-Words Representation
Bag-of-words representations, together with the inverted file indexing technique, have demonstrated impressive performance in terms of scalability and accuracy. However, bag-of-words representations discard all the spatial information of visual words which greatly limit the descriptive ability and thus the search precision is usually very low. Many approaches were proposed with the aim of compensating the loss of spatial information for improving the search accuracy. Other approaches utilize full geometric verification methods, which achieve robustness and high search precision at significantly computational expense. A more efficient approach is to augment a basic visual-word representation with spatial relationships between its neighboring features, but existing approaches based on this idea have incurred high memory cost.
The state-of-the-art methods for image search tend to rely on bag-of-words representations and scalable textual indexing schemes. However, bag-of-words representation disregards all the spatial layout information of visual words; hence greatly limits the descriptive ability and leads to a low search precision. To compensate the loss of spatial information, some methods utilize a spatial pyramid matching scheme which partitions an image into increasingly fine sub-regions and only matches visual words inside the corresponding sub-region. Even though such scheme is fast and simple to implement, hard gridding scheme is sensitive to misalignment of sub-regions caused by large geometric transformations.
Full Geometric Verification
Full geometric verification methods, which utilize robust fitting methods, such as RANSAC or Least Median of Squares (LMEDS), can cope with general transformations and hence they are usually employed to remove false matches. Typically, a hypothesized transformation model between two images is estimated based on features. All of the features are then verified by the estimated model and those that are inconsistent with the model are removed as outliers. But a full model fitting procedure is too computationally expensive. In addition, due to the large percentage of outliers arising from quantization errors and background clutter, full fitting methods, such as RANSAC or LMEDS usually perform poorly.
To address the problems of full geometric verification, some methods employ an outlier-filtering strategy based on an efficient but weaker geometric verification before applying a full model fitting procedure. Other methods try to augment bag-of-words representation with spatial relationships between its neighboring visual words. For example, one conventional method bundles visual words into groups by detected maximally stable extremal regions (MSER), and enforce a set of geometric constraints within each group. However, the performance of such a method largely depends on the robustness of bundling scheme, i.e. the repeatability of the MSER detector, which may easily fail on textual document images where few uniform regions can be detected. Other methods utilize a spatial coding scheme, which takes each feature point as a center and encodes the relative positions between this feature and its neighboring features. Unfortunately, such methods cost too much memory space for storing the spatial maps for all features, and therefore tends to be rather impractical.
The Hough Transform
One other strategy is to use the Hough transform to deal with outliers, i.e. false matches. The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.
The classical Hough transform was concerned with the identification of lines in the image, but later the Hough transform has been extended to identifying positions of arbitrary shapes, most commonly circles or ellipses.
The Generalized Hough Transform or GHT, is the modification of the Hough Transform using the principle of template matching. This modification enables the Hough Transform to be used for not only the detection of an object described with an analytic equation (e.g. line, circle, etc.), but also for the detection of an arbitrary object described with its model.
The problem of finding the object (described with a model) in an image can be solved by finding the model's position in the image. With the Generalized Hough Transform, the problem of finding the model's position is transformed to a problem of finding the transformation's parameter that maps the model into the image. As long as the value of the transformation's parameter is known, the position of the model in the image can be determined.
The original implementation of the GHT uses edge information to define a mapping from orientation of an edge point to a reference point of the shape. In the case of a binary image where pixels can be either black or white, every black pixel of the image can be a black pixel of the desired pattern thus creating a locus of reference points in the Hough Space. Every pixel of the image votes for its corresponding reference points. The location of the cell with maximum votes in the Hough Space indicates the pattern parameters of the image.
The main drawbacks of the GHT are its substantial computational and storage requirements that become acute when object orientation and scale have to be considered. Orientation information of the edge has been utilized for decreasing the cost of the computation. Other GHT techniques have been suggested such as the SC-GHT (Using slope and curvature as local properties).
Hough transforms have been used to remove obvious outliers (false matches of features) and identify clusters of inliers (true matches of features) which imply consistent transformation interpretation. Some methods utilize four parameters: the location on a two dimensional plane, scale and orientation of a query feature relative to its matched indexed features, to vote for one of the coarsely quantized bins in Hough space. Clusters of features which cast into bins with more than three votes will be used for estimating affine projection parameters based on a least-squares solution. Other methods simply utilize the differences of scale and orientation between two matched visual words and filters out matches which do not vote for the main bins.
However, present Hough transform methods are carried out based on an individual corresponding feature pair without considering any constraints from surrounding features, which is an important clue for outlier filtering. Moreover, due to the rough parameter estimations in these methods, large quantization bin size is pre-determined and used for Hough space voting, which yields a limited inlier-outlier-separation and place more computation burdens on the following full geometric verification. Even though such Hough transform methods may be fast, all of these methods utilize very roughly estimated parameters, coarse and pre-determined Hough spaces and a simple voting strategy. Therefore, they tend not to perform as well as RANSAC in segmenting outliers from inliers, especially for complex transformations, e.g. affine or homography transformation.