The exemplary embodiment relates to visual attention prediction, and in particular, to the prediction of a topographic visual saliency map when given an input image.
Visual attention has been often used in computer vision as a pre-processing step in order to focus subsequent processing on regions of interest in images. This has proved particularly useful as vision models and datasets increase in size. Saliency map prediction finds application in tasks such as automatic image cropping (Stentiford, F., “Attention based auto image cropping,” 5th Intl Conf. on Computer Vision Systems, pp. 1-9, 2007) and content aware image resizing (Achanta, R., et al., “Saliency detection for content-aware image resizing,” 16th IEEE Int'l Conf. on Image Processing (ICIP), pp. 1001-1004, 2009), image thumb-nailing (Marchesotti, L., et al., “A framework for visual saliency detection with applications to image thumbnailing,” ICCV, pp. 2232-2239, 2009), object recognition (Gilani, S., et al., “PET: an eye-tracking dataset for animal-centric Pascal object classes,” 2015 IEEE Intl Conf. on Multimedia and Expo (ICME), pp. 1-6, 2015), and fine-grained, scene, and human action classification (Sharma, G., et al., “Discriminative spatial saliency for image classification,” CVPR, pp. 3506-3513, 2012).
Some traditional saliency detection methods have focused on designing models that explicitly model biological systems (Itti, L, et al., “A model of saliency-based visual attention for rapid scene analysis,” TPAMI, (11):1254-1259, 1998). Others have used data-driven approaches to learn patch-level classifiers which give a local image patch a “saliency score,” using eye-fixation data to derive training labels (Kienzle, W., et al., “A Nonparametric Approach to Bottom-Up Visual Saliency,” NIPS, pp. 405-414, 2007; Judd, T., et al., “Learning to predict where humans look. CVPR, pp. 2106-2113, 2009, hereinafter, “Judd 2009”). Hierarchical models have also been used to extract saliency maps, with model weights being learned in a supervised manner. Recently, neural network architectures developed for semantic annotation tasks such as categorization and object localization have been adapted to use as attentional models (Kümmerer, M., et al., “Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet,” ICLR Workshop, arXiv:1411.1045, pp. 1-12, 2015; and Pan, J., et al., “End-to-end Convolutional Network for Saliency Prediction,” Technical report, arXiv:1507.01422, pp. 1-6, 2015). This approach has benefitted from the availability of large visual attention datasets (Jiang, M., et al., “SALICON: Saliency in Context,” CVPR, pp. 1072-1080, 2015, hereinafter, “Jiang 2015”; and Xu, P., et al., “TurkerGaze: Crowdsourcing Saliency with Webcam based Eye Tracking,” Technical report, arXiv:1504.06755v1, pp. 1-9, 2015). These deep methods, however, have used loss functions more suited to semantic tasks, such as classification or regression losses.