Image and object retrieval has been an active research topic for decades due to its desired applications in, for example, web image search, mobile visual search (e.g., mobile product image search on a mobile device) and personal photo management. Many conventional retrieval techniques adopt the bag-of-words (BOW) model. In the bag-of-words model, a visual vocabulary is first built by clustering on a large collection of local features such as scale-invariant feature transform (SIFT) features. In the retrieval stage, each extracted feature from the query is assigned to its closest visual word in the vocabulary. The query image is accordingly represented by a global histogram of visual words, and matched with database images by tf-idf weighting using inverted files.
A fundamental problem in object retrieval techniques using the bag-of-words model is its lack of spatial information. Various techniques have been proposed to incorporate spatial constraints into the bag-of-words model to improve the retrieval accuracy. However, these techniques tend to be too strict or only encode weak constraints so that they only partially solve the problem for limited cases. While the bag-of-words model works generally well benefiting from its effective feature representation and indexing schemes with inverted files, it still suffers from problems including but not limited to, the loss of information (especially spatial information) when representing the images as histograms of quantized features, and the deficiency of features' discriminative power, either because of the degradation caused by feature quantization, or due to its intrinsic incapability to tolerate large variation of object appearance.
As a result, the BOW model does not work well for certain applications (e.g., mobile product image search) where the objects (e.g., products) in the database images are mostly well aligned and captured in studio environments with controlled lighting. The background is often clean and texture details are clear. See FIG. 4A for an example of a database image. The BOW model also does not work well when the query images are taken under different lighting conditions than the database images and/or with a clustered background. In such situations, large viewpoint variations may exist between the query and database images. Moreover, motion blur and out-of-focus blur are common in query images captured by mobile phones and may further degrade a BOW model. See FIG. 4B for example query images taken by a mobile device. The BOW model may also struggle for objects that are non-planar (e.g., shoes) and/or less textured (e.g., clothing). Therefore, standard RANSAC-based verification can fail. The BOW model may additionally not perform well for objects that are visually similar to each other, such as shoes. For such objects, only a small portion of visual features can discriminate them so a fine-grained discrimination strategy is needed for correct identification.
Moreover, when a BOW model is used to perform certain object retrieval tasks, the results may be negatively affected by features extracted from the background of the query images. Even when the location of the object is specified in the query image, the features around the occluding boundaries of the object may still be largely different from those extracted from clean background. The query object of the query image may be segmented by manual labeling. However, simple labeling (e.g., specifying the object by a bounding rectangle) can yield inaccurate segmentation results and/or be overly burdensome for users.