Computerized image processing comprises the automated extraction of meaningful information from images, typically from digital images using digital image processing techniques. Some instances of such image processing involve automated identification of salient objects in photographs or pictures.
Saliency estimation is typically directed to identifying those parts of the scene in an image that are most important and/or informative. Such saliency detection or estimation can be applied to a number of vision problems, including but not limited to content-based image retrieval, image compression, image segmentation, and object recognition. For example, an online marketplace or an image-based search engine may process a large number of images of respective objects. Such images typically has a foreground object or salient feature that is the intended subject of the captured image. Automated recognition of the respective subjects of multiple digital images uploaded to such a marketplace can be complicated or frustrated by unreliable or inaccurate saliency detection. Saliency estimation can often be a significant preprocessing step for background removal or object/product detection and recognition in large ecommerce applications.
Many different methods for saliency map estimation have been proposed. Most existing approaches can be categorized into unsupervised approaches (typically bottom-up) and supervised approaches (typically top-down, but more recent approaches include a combination of top-down and bottom-up approaches). Unsupervised approaches identify salient features with reference only to the subject image, while supervised approaches employ material (such as a database of template images) external to the subject image.
While supervised approaches are able to integrate multiple features and in general achieve better performance than unsupervised methods, the necessary data collection and training process is expensive. Also, compared to traditional specialized object detectors (e.g., pedestrian detection), where objects in the same class have a relatively large degree of visual consistency, similar salient objects can have vastly different visual appearances in applications where the salient object can vary widely in type and nature—as is the case, for example, in product/item images in an online marketplace. Furthermore, the process of generating pixel-wise ground truth annotations itself is expensive and labor intensive, and sometimes may even be impossible considering the scale of modern massive long-tailed visual repositories. This is typically the case in large e-commerce scenarios.
The headings provided herein are merely for convenience and do not necessarily affect the scope or meaning of the terms used.