Humans can analyze a scene very quickly and easily, effortlessly noticing objects, even those that the viewer has never seen before. A viewer may be looking for something in particular; this affects how attention is paid within the scene. Natural scenes that a person is likely to encounter on a day to day basis are often very complex, made more so by lighting conditions. People use their own built-in attention without a second thought. Computationally, however, paying attention to a scene and extracting locations or regions of high saliency provides a great challenge. A vision system must be able to determine what locations in a scene draw the most attention, and then segment the attended object so that it can be identified or interpreted.
A number of researchers have shown interest in systems that compute the saliency of a scene. For example, feature-based attention works at the pixel level and computes attention based on the saliency of a given location within the scene at a specific location. The attention work of Itti and Koch (2000) is probably the most well-known algorithm that employs such an approach, which computes attention by constructing a saliency map from a set of biologically inspired features extracted from the image. See L. Itti and C. Koch, A saliency-based search mechanism for overt and covert shifts of visual attention, Vision Research, 40: 1489-1506, 2000.
The work of Itti and Koch (2000) has been modified to incorporate top-down biasing of the attention in the work of Navalpakkam and Itti (2005 and 2006). See V. Navalpakkam, L. Itti, Modeling the Influence of Task on Attention, Vision Research, 45: 205-231, 2005; and V. Navalpakkam, and L. Itti, An integrated model of top-down and bottom-up attention for optimal object detection, In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1-7, 2006. The algorithm described by Navalpakkam and Itti (2005 and 2006) breaks apart the image into a set of Gaussian pyramids corresponding to color, intensity, and orientation at a series of scales, which are combined across scales and merged into the saliency map. The system attends to the point that corresponds to the maximum value in the saliency map, applies inhibition of return to that location, and shifts to the next most salient point. This process continues until the program attends to a maximum number of locations, or the user terminates the program. The most significant problem with this method is its inefficiency; it needs to compute the entire image before returning a saliency map or salient locations. Other feature-based saliency methods use a similar approach, but may differ in the types of features or number of levels of Gaussian pyramids used in the algorithm.
Attempts to parallelize the above saliency map computation for the image have been restricted to computing different features on different processors and then combining them in the end. Therefore, with or without parallelization, the entire image needs to be processed before a saliency map is available. Thus, if an application needs salient regions quickly, the above methods will fail. A simplistic approach of computing saliency on parts of the image and just tiling them together will not work because the resulting maps are local saliency maps that do not reflect the global saliency map.
In the publications by Draper and Lionelle (2003) and Orabona et al. (2005), the researchers have described the creation of object-based saliency (or attention) algorithms. See B. Draper and A. Lionelle, Evaluation of Selective Attention under Similarity Transforms, In Workshop on Performance and Attention in Computer Vision. Graz, Austria, April 2003; and F. Orabona, G. Metta, and G.
Sandini, Object-based Visual Attention: A Model for a Behaving Robot, In 3rd International Workshop on Attention and Performance in Computational Vision (in CVPR 2005), San Diego, Calif., June 2005. Such systems are computationally expensive and must process the entire image before a saliency map can be generated.
An alternative to processing the entire image would be to develop parallel versions of the algorithms described above. For example, one way to parallelize the algorithms would be to compute different features on different processors and then combine the features. Such a process would have the same limitation as feature-based methods and would not give a parallel or recursive saliency method.
Attempts to parallelize the saliency map computation for the image have been restricted to computing different features on different processors and then combining them in the end. With or without parallelization, the entire image needs to be analyzed and only then is a saliency map available. Thus, previous methods of finding salient or interesting regions have two main shortcomings: (1) they need to process the entire image before the saliency map can be outputted, and (2) they are very slow for large images.
Therefore, a continuing need exists for a system that provides a fast method for finding interesting regions in large-sized imagery and video without the need to process the entire image before obtaining results.