The exemplary embodiment relates to object identification in images. It finds particular application in connection with retail product retrieval in images.
For many applications it is desirable to be able to recognize specific objects in a photographic image or video. For example, given an image and a predefined set of objects or categories, the aim is to output all instances of the specific object or category of objects. However, object detection is a challenging task, due to the variety of imaging conditions (e.g., viewpoints, environments, and lighting conditions). Retail products, such as shampoos, laundry detergents, cereal packets, and the like in images are particularly difficult to distinguish based on texture and shape alone, since similar but slightly different products are often differentiated by differences in color. For example, a laundry detergent with one scent may include a predominant purple colored area while a similar detergent with a different scent may include a predominant green colored area. The color differences allow consumers to identify similar products while being able to distinguish between different varieties. However, these differences have not been easy to exploit in product identification in images.
Some studies have been undertaken which use color names or color attributes for object retrieval or image classification. For example, object retrieval has been performed by region matching, and by describing a region of an image using one of ten color names, where a color name is assigned to a region according its mean RGB value. See, Liu, et al., “Region-based image retrieval with high-level semantic color names,” Proc. 11th Int'l Multimedia Modelling Conf. (MMM 2005), pp. 180-187 (2005). In a similar approach, regions of an image are assigned a dominant color, by assigning each pixel in a region to one of eight color names and assigning the region the name of the most common color. See Vaquero, et al., “Attribute-based people search in surveillance environments,” IEEE WACV, pp. 1-8 (2009). Color names have been used for object detection by extracting, from an image window, a color template as a grid containing color name distributions in each cell. These templates were used to train window classification models. See Khan, Fahad Shahbaz et al., “Color attributes for object detection,” CVPR, pp. 3306-3313 (2012). In another approach, learned color attributes, rather than color names, are used to describe local patches for image classification. See Khan, et al., “Discriminative color descriptors,” CVPR, pp. 2866-2873 (2013).
However, such embeddings do not work for image-based queries because the concepts to be recognized (the products) are of a very high level. To compute similarity between images, the images to be compared are often embedded in a feature space by extraction of local descriptors, such as SIFT descriptors. See Lowe, D. G., “Object recognition from local scale-invariant features,” ICCV, vol. 2, pp. 1150-1157 (1999). Fisher vectors are often used to embed these local descriptors into a higher-dimensional space. An aggregation and normalization of the embedded descriptors is used to generate an overall representation of the image. This provides a sophisticated and discriminative embedding for images. However, because of the differences between color name-based and texture-based image representations, the two approaches are not readily combinable for querying.
The exemplary embodiment allows both color name information and image texture information to be used jointly for object retrieval or object classification by embedding representations in a joint high-dimensional and discriminative feature space.