With the rapid advancement of digital camera and mobile imaging technologies, we have witnessed a phenomenal increase of both professional and amateur photographs in the past decade. Large-scale social media companies, e.g., Flickr, Instagram, and Pinterest, further enable their users to share photos with people all around the world. As millions of new photos are added daily to the Internet, content-based image retrieval has kept high attention of the multimedia research community. Nevertheless, while most existing systems rely on low-levels features (e.g., color, texture, shape) or semantic information (e.g., object classes, attributes, events), similarity in visual composition has not been adequately exploited in retrieval [6, 17, 7, 36].
In photography, composition is the art of positioning or organization of objects and visual elements (e.g., color, texture, shape, tone, motion, depth) within an photo. Principles of organization include balance, contrast, Gestalt perception and unity, geometry, rhythm, perspective, illumination, and viewing path. Automated understanding of photo composition has been shown to benefit a number of applications such as summarization of photo collections [26] and assessment of image aesthetics [27]. It can also be used to render feedback to the photographer on the photo aesthetics [37] [36], and suggest improvements to the image composition through image retargeting [22] [4]. In the literature, most work on image composition understanding have focused on design rules such as the simplicity of the scene, visual balance, golden ratio, the rule of thirds, and the use of diagonal lines. These rules are mainly concerned with the 2D rendering of objects or the division of the image frame. They are by no means exhaustive for capturing the wide variations in photographic composition.
Standard composition rules such as the rule of thirds, golden ratio and low depth of field have played an important role in early works on image aesthetics assessment [5] [23]. Obrador et al. [27] later showed that by using only the composition features, one can achieve image aesthetic classification results that are comparable to the state-of-the-art. Recently, these rules have also been used to predict high-level attributes for image interestingness classification [8], recommend suitable positions and poses in the scene for portrait photography [37], and develop both automatic and interactive cropping and retargeting tools for image enhancement [22] [4]. In addition, Yao et al. [36] proposed a composition-sensitive image retrieval method which classifies images into horizontal, vertical, diagonal, textured, and centered categories, and uses the classification result to retrieve exemplar images that have similar composition and visual characteristics as the query image. However, as we mentioned before, these features or categories are all about 2D rendering, with 3D impression not taken into account.
Meanwhile, various methods have been proposed to extract 3D scene structures from a single image. The GIST descriptor [28] is among the first attempts to characterize the global arrangement of geometric structures using simple image features such as color, texture and gradients. Following this seminal work, a large number of supervised machine learning methods have been developed to infer approximate 3D structures or depth maps from the image using carefully designed models [15] [16] [10] [32] [25] or grammars [11] [12]. In addition, models tailored for specific scenarios have been studied, such as indoor scenes [20] [13] [14] and urban scenes [3]. However, these works all make strong assumptions on the structure of the scene, hence the types of scene they can handle in practice are limited. Despite the above efforts, obtaining a good estimation of perspective in an arbitrary image remains an open problem.
Typical vanishing point detection algorithms are based on clustering edges in the image according to their orientations. Kosecka and Zhang proposed an Expectation Maximization (EM) approach to iteratively estimate the vanishing points and update the membership of all edges [19]. Recently, a non-iterative method is developed to simultaneously detect multiple vanishing points in an image [34]. These methods assume that a large number of line segments are available for each cluster. To reduce the uncertainty in the detection results, a unified framework has been proposed to jointly optimize the detected line segments and vanishing points [35]. For images of scenes that lack clear line segments or boundaries, specifically the unstructured roads, texture orientation cues of all the pixels are aggregated to detect the vanishing points [30] [18]. But it is unclear how these methods can be extended to general images.
Image segmentation algorithms commonly operate on low-level image features such as color, edge, texture and the position of patches [33] [9] [21] [2] [24]. But it was shown in [31] that given an image, images sharing the same spatial composites can help with the unsupervised segmentation task.