In an object retrieval system, a user may select a query image and specify a region of interest in the query image around the object of interest (referred to as the query object) to specify the search intent. Features may be extracted from the region of interest and quantized into visual words. The visual words representation of the region of interest may be used to identify relevant images.
However, current object retrieval methods may fail to return satisfactory results under certain circumstances. For example, if the region of interest specified by the user is inaccurate or if the object captured in the query image is too small to provide discriminative details, the object retrieval may result in erroneous or few matches with similar objects. In other words, object retrieval based on visual words may not achieve reliable search results where the visual words extracted from the region of interest are unable to reliably reveal the search intent of the user.
A user typically specifies a region of interest using a bounding box, i.e., a rectangle that specifies a portion of the query image. However, the bounding box may be a rough approximation of the region of interest representing the query object. For example, the bounding box may not accurately represent the region of interest because the bounding box may be rectangular while the region of interest may have a complex shape. In this example, the visual words extracted from the bounding box may include information that is unrelated to the search intent. In addition, in cases where the region of interest is too small, or where the query object lacks discriminative details, the number of visual words derived from the bounding box may be insufficient to perform a reliable relevance estimation, with the consequence that irrelevant images may be returned.