Saliency estimation has become a valuable tool in image processing wherein image regions of attention, by a human observer, are defined by a mask, which is referred to herein as a saliency map. But the automatic, computational identification of image elements of a particular image that are likely to catch the attention of a human observer is a complex, cross-disciplinary problem. In order to obtain realistic, high-level models, a combination of insights needs to be used from various fields such as the neurosciences, biology, and computer vision areas. Recent research, however, has shown that computational models simulating low-level, stimuli-driven attention are successful and represent useful tools in many application scenarios, including image segmentation, resizing and object detection. However, existing approaches exhibit considerable variation in methodology, and it is often difficult to attribute improvements in result quality to specific algorithmic properties.
Perceptual research indicates that the most influential factor in low-level visual saliency appears to be contrast. However, the definition of contrast in previous works is based on various different types of image features, including color variation of individual pixels, edges and gradients, spatial frequencies, structure and distribution of image patches, histograms, multi-scale descriptors, or combinations thereof. The significance of each individual feature often remains unclear, and recent evaluations show that even quite similar approaches sometimes exhibit considerably varying performance.
Methods that model bottom-up, low-level saliency can be roughly classified into biologically inspired methods and computationally oriented approaches. Biological methods are generally based on an architecture whereby the low-level stage processes features such as color, orientation of edges, or direction of movement. One implementation of this model uses a difference of Gaussians approach to evaluate those features. However, the resulting saliency maps tend to be blurry, and often overemphasize small, purely local features which render this approach less useful for applications such as segmentation, detection, and the like.
Computational methods (which may be inspired by biological principles), in contrast have a strong relationship to typical applications in computer vision and graphics. For example, frequency space methods determine saliency based on the amplitude or phase spectrum of the Fourier transform of an image. Saliency maps resulting from computational processing preserve the high level structure of an image but exhibit undesirable blurriness and tend to highlight object boundaries rather than the entire image area.
Colorspace techniques can be distinguished between approaches that use a local analysis and those that use a global analysis of (color-) contrast. Local methods estimate the saliency of a particular image region based on immediate image neighborhoods, for example, based on dissimilarities at the pixel-level, using multi-scale Difference of Gaussians or histogram analysis. While such approaches are able to produce less blurry saliency maps, they are agnostic of global relations and structures, and they may also be more sensitive to high frequency content like image edges and noise. Global methods consider contrast relationships over the complete image. For example, different variants of patch-based methods estimate the dissimilarities between image patches. While these algorithms are more consistent in terms of global image structures, they suffer from involved combinatorial complexity, and thus are applicable only to relatively low resolution images, or they need to operate in spaces of reduced dimensionality, resulting in loss of small, potentially salient detail.
Another method that also works on a per-pixel basis achieves globally more consistent results by computing color dissimilarities to the mean image color. Such a technique utilizes Gaussian blur in order to decrease the influence of noise and high frequency patterns. However, this method does not account for any spatial relationships inside the image, and thus may highlight background regions as being salient.
Another technique combines multi-scale contrast, local contrast based on surrounding, context, and color spatial distribution to learn a conditional random field (CRF) for binary saliency estimation. However, the significance of features in the CRF remains unclear. One global contrast-based approach that provides good performance generates three dimensional (3-D) histograms and computes dissimilarities between histogram bins. However, this method has difficulty in handling images with cluttered and textured backgrounds.
In view of the problems encountered when utilizing prior art approaches, the inventors recognized that it would be advantageous to develop a visual saliency estimation process characterized by the use of a reduced set of image measures to efficiently and quickly process image data to produce pixel-accurate saliency masks.