Humans and other primates move their eyes to select visual information from a visual scene. This eye movement allows them to direct the fovea, which is the high-resolution part of their retina, onto relevant parts of the visual scene, thereby allowing processing of relevant visual information and allowing interpretation of complex scenes in real time.
Commonalities between fixation patterns in synthetic or natural scenes from different individuals allow forming of computational models for predicting where people look in a visual scene (see references [1], [2], [3], and [4], incorporated herein by reference in their entireties). There are several models inspired by putative neural mechanisms for making such predictions (see reference [5], incorporated herein by reference in its entirety). One sensory-driven model of visual attention focuses on low-level features of a visual scene to evaluate salient areas of the visual scene. By way of example and not of limitation, low-level features generally refer to color, intensity, orientation, texture, motion, and depth present in the visual scene.
With origins in Feature Integration Theory by Treisman and Gelade in 1980 (see reference [45], incorporated herein by reference in its entirety) and a proposal by Koch and Ullman (see reference [38], incorporated herein by reference in its entirety) for a map of a primate visual system that encodes extent to which any location in a given subject's field of view is conspicuous or salient, a series of ever refined models have been designed to predict where subjects will fixate their visual attention in synthetic or natural scenes (see references [1], [3], [4], [6], and [7], incorporated herein by reference in their entireties) based on bottom-up, task-independent factors (see references [31] and [47], incorporated herein by reference in their entireties).
As used here, a “task” refers to a visual task that a subject may have in mind while viewing a visual scene. For instance, the subject may be given an explicit task or may give himself/herself an arbitrary, self-imposed task to “search for a red ball” in a visual scene. By viewing the visual scene with the task in mind (in this example, searching for red balls in the visual scene), the subject's viewing of the visual scene will be dependent on the particular task.
Bottom-up, task independent factors are those factors found in the visual scene that attract a subject's attention strongly and rapidly, independent of tasks the subject is performing. These bottom-up factors depend on instantaneous sensory input and usually relate to properties of an input image or video.
Bottom-up factors are in contrast to top-down factors, which take into account internal states of the subject such as goals of the subject at the time of viewing and personal experiences of the subject. An example of a goal would be that provided in the previous example of searching for a red ball, where the subject views the visual scene with a particular emphasis (or goal) on finding a red ball. An example of a personal experience would be if, upon seeing one piece of furniture in the visual scene, the subject's personal experience tells the subject to search for other pieces of furniture because the visual scene is likely to be displaying an interior of a building.
As another example, a top-down factor can involve storing an image of a face of a specific person that a subject is searching for and correlating the image of the face across an entire visual scene. Correlating refers to matching a feature (the face from the image in this example) at different locations of the visual scene. A high correlation would correspond with a location where the feature is likely to be located.
In these prediction models (see references [1], [6], and [7]), information on low-level features of any visual scene (such as color, intensity, and orientation) are combined to produce feature maps through center-surround filtering across numerous spatial scales (see references [1] and [6]). These feature maps across numerous spatial scales are normalized and combined to create an overall saliency map of the visual scene, which predicts human fixations significantly above chance (see reference [7]).