Predicting where people look when presented with a visual stimulus is difficult. In fact, prediction (or determination) of where people look is a necessary precursor to a variety of algorithms such as automatic cropping, image enhancement, distractor attenuation, or measuring effectiveness of advertising.
An observer's response to a (visual) stimulus can be decomposed into two parts: a low-level, stimulus-induced, response that is task-independent, and a task specific, cognitively controlled response employing higher order features. The low-level vision aspect is saliency, while the high-level aspect is called visual attention.
In recent times, models for predicting saliency can broadly be decomposed into two classes: physiological or machine learning. The class distinction notwithstanding, the general framework of the known saliency prediction techniques is to determine a model and apply the model to an input image in order to predict the saliency of the input image model is typically image independent.
The physiological class of saliency prediction algorithms generally starts with a biologically plausible architecture where low-level features are combined in order to identify elements of an image that are likely to be salient. In some known methods of predicting saliency of an image, the image is analysed with respect to features including colour (e.g., red-green and blue-yellow hue lines), luminance, and texture orientation (e.g., using Gabor filters). The feature analysis is performed over three scales, to take into account foveated vision. The outcome of the feature extraction is then hierarchised according to a principle of excitation-inhibition. Once a particular part of the human visual system is excited, that particular part then becomes inhibited for a short period of time. The inhibition allows the human visual system to concentrate on different stimuli. While such a biologically plausible architecture is indeed modelled on the physiology of the human visual system, implementation of the architecture assumes that the distance measure used in the computation of saliency is unique and applicable to all stimuli.
Variations on the above described physiological class of saliency determination algorithms have been proposed and can deliver accurate results when a single object is present over a background. Such algorithms can also deliver accurate results when the object is sufficiently distinct from its background in terms of Lab colour space values. However, such methods do not work well on more complex images.
A more recent method of deterministic saliency prediction recognises that the dimensionality of saliency prediction reaches beyond simple features often employed. The method takes into account known visual effects such as induction, also known as simultaneous colour contrast, on perceived saliency. In particular, the method effectively modifies a colour distance function and takes into account influence of image content on saliency. The method is limited by an extremely large number of phenomena of “optical illusion” related to human perception.
Another method of predicting saliency is to employ machine learning techniques to select and weight image features. The combination of image features is not chosen a priori, but rather inferred from a ground truth of human observations. High-level and low-level features (e.g., face detection, colour, luminance, horizon line) are selected, and using eye tracking data over twelve thousand (12,000) images, classifies the features and their optimal combination with a linear support vector machine (SVM). In a slightly different method, the reliability of various (unspecified) image features over user-input data with conditional random fields is measured to produce a probabilistic salient map.
While machine learning methods take into account the aspect of human observation, calculations are performed on a large-scale and aim for a silver bullet formula/feature combination that best defines saliency. However, because of the large number of dimensions of the problem, such methods are unreliable over a variety of images.