The exemplary embodiment relates to digital image processing. It finds particular application in connection with detection of salient regions and image thumbnailing in natural images based on visual similarity.
Image thumbnailing consists of the identification of one or more regions of interest in an input image: for example, salient parts are aggregated in foreground regions, whereas redundant and non informative pixels become part of the background. The range of applications where thumbnailing can be applied is broad, including traditional problems like image compression, image visualizations, adaptive image display in small devices, but also more recent applications like variable data printing, assisted content creation, automatic blogging, and the like.
Image thumbnailing is strongly related with the detection of salient regions. Saliency detection is seen as a simulation or modeling of the human visual attention mechanism. In the field of image processing, it is understood that some parts of an image receive more attention from human observers than others. Saliency refers to the “importance” or “attractiveness” of the visual information in an image. A salient region may describe any relevant part of an image that is a main focus of a typical viewer's attention. Visual saliency models have been used for feature detection and to estimate regions of interest. Many of these methods are based on biological vision models, which aim to estimate which parts of images attract visual attention. Implementation of these methods in computer systems generally fall into one of two main categories: those that give a number of relevant punctual positions, known as interest (or key-point) detectors, and those that give a more continuous map of relevance, such as saliency maps. Saliency maps can provide richer information about the relevance of features throughout an image. While interest points are generally simplistic corner (Harris) or blob (Laplace) detectors, saliency maps can carry higher level information. Such methods have been designed to model visual attention and have been evaluated by their congruence with fixation data obtained from experiments with eye gaze trackers.
Recently, saliency maps have been used for object recognition, image categorization, automated image cropping, adaptive image display, and the like. For example, saliency maps have been used to control the sampling density for feature extraction. Alternatively, saliency maps can be used as foreground detection methods to provide regions of interest (ROI) for classification. It has been shown that extracting image features in the locality of ROIs can give better results than sampling features uniformly through the image. A disadvantage is that such methods may miss important context information from the background.
A distinction can be made between a type of saliency detection which aims to detect the most interesting object in an image, irrespective of context (context independent saliency detection) and a concept type of saliency detection in which specific type of object is searched for in the image.
The typical context independent case is often solved by bottom-up methods which seek to detect the most interesting part of the image, without targeting any specific object or concept. Concept type saliency detection is often referred to as top-down saliency detection.
Visual saliency and attention has been modelled with three categories of approaches inspired by the human visual system. Bottom-up, stimulus-driven methods are based on intrinsic low-level features such as contrast, color, orientation, and the like. Top-down methods take into account higher order information (context, structure) about the image in the analysis. Hybrid approaches aim to leverage benefits of the other two categories.
Bottom-up strategies are by far the most common and they are advantageous if the low level features represent the salient parts of the image well (e.g., isolated objects, uncluttered background). Top-down methods help when other factors dominate (e.g., the presence of human face), but they are lacking in generality. Hybrid approaches, in general, are designed in a two stage fashion where top-down strategies filter out noisy regions in bottom-up saliency maps.
One of example of bottom-up methods is described in L. Itti, C. Koch, E. Niebur, et al., “A Model of Saliency-Based Visual Attention for Rapid Scene Analysis.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254-1259 (1998). In this approach, multi-scale topographic features characterizing color, intensity and texture are extracted and combined with “center-surround” operations to obtain saliency maps. Another method is described in Xiaodi Hou and Liqing Zhang, “Saliency Detection: A Spectral. Residual Approach,” IEEE Conf on Computer Vision & Pattern Recognition (2007). The methods is based on spectral residual of images in the spectral domain that locates salient regions by taking into account the “noise” in the logarithmic magnitude frequency curve of an image.
Gao, et al. reformulated the “center-surround” hypothesis in a decision theoretic framework (see, D. Gao and N. Vasconcelos, “Bottom-up saliency is a discriminant process, Proceedings of IEEE Int'l Conf. on Computer Vision (ICCV), Rio de Janeiro, Brazil (2007); D. Gao, V. Mahadevan and N. Vasconcelos, “The discriminant center-surround hypothesis for bottom-up saliency,” Proc. of Neural Information Processing Systems (NIPS), Vancouver, Canada (2007)). Saliency detection is interpreted as a binary classification problem where saliency is identified with features that discriminate “center” and “surround” regions well.
Top-down visual attention processes are considered to be driven by voluntary control, and related to the observer's goal when analyzing a scene. These methods take into account higher order information about the image such as context, structure, etc. Object detection can be seen as a particular case of top-down saliency detection, where the predefined task is given by the object class to be detected (See, Jiebo Luo, “Subject content-based intelligent cropping of digital photos,” in IEEE Intl. Conf. on Multimedia and Expo (2007)).
An additional example of a top-down approach is where the system first classifies the image in twrms of landscape, close-up, faces, etc. and then it applies the most appropriate thumbnailing/cropping strategy (See, G. Ciocca, C. Cusano, F. Gasparini, and R. Schettini, “Self-adaptive image cropping for small display,” in IEEE Intl. Conf. on Consumer Electronics (2007)).
Recent Hybrid approaches combine bottom-up with classic top-down object detection strategies. One approach blends the Viola-Jones face detector (Jones, M. J., Rehg, J. M., “Statistical Color Models with Application to Skin Detection,” IJCV(46), No. 1, pp. 81-96 (January 2002)) with the Itti classic approach (See, L. Itti and C. Koch, “Computational Modeling of Visual Attention,” Nature Reviews Neuroscience, 2(3): 194-203 (2001), hereinafter “Itti and Koch 2001”). In a similar fashion, Huang, et al. combines their saliency map based on color, shape, and texture with face and text detector and uses branch and bound algorithm to find optimal solutions efficiently (See, Chen-Hsiu Huang, Chih-Hao Shen, Chun-Hsiang Huang and Ja-Ling Wu, “A MPEG-7 Based Content-aware Album System for Consumer Photographs,” Bulletin of the College of Engineering, NTU, No. 90, pp. 3-24 (February 2004)).
Recent approaches suggest that saliency can be learned, either using global features or sufficient manually labelled examples (See, T. Liu, J. Sun, N. Zheng, X. Tang and H. Shum, “Learning to Detect A Salient Object,” CVPR (2007), hereinafter “Liu, et al.”), or directly from human eye movement data through a simple parameter-free approach.
In contrast, Z. Wang, B. Li, “A Two-Stage Approach to Saliency Detection in Images,” In ICASSP 2008 IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (March/April 2008) combines spectral residual for bottom-up analysis with features capturing similarity and continuity based on Gestalt principles.
Above-mentioned U.S. patent application Ser. No. 12/250,248 detects regions of interest (ROIs) by a learning approach. The method uses the information related to the position and the size of the manually selected ROIs. Above-mentioned U.S. application Ser. No. 12/033,434 also proposes a method for detecting salient parts of an image, but the approach is heavily dependent on the semantic context in which either the image or its thumbnail is used. A visual concept is derived from each image and the ROI that corresponds to that visual concept is sought. Therefore, an image can lead to completely different thumbnails, depending on the context.