The exemplary embodiment relates to the evaluation of aesthetic quality in images and finds particular application in connection with a system and method for learning classifiers for visual attributes of images that relate to an overall assessment of image quality and for using the trained classifiers for visual attribute-based querying.
To assist in processing of images and videos, computer vision techniques have been developed, such semantic recognition, for identifying the visual content of an image. These techniques are able to predict whether an image is of a dog or a cat, for example. However, predicting whether an image will be perceived as visually appealing to people is more challenging and people themselves are often unable to pinpoint why a particular image is attractive or unattractive. Some attempts have, however, been made to evaluate aesthetic qualities of images by computer-implemented methods. See, for example R. Datta, et al., “Studying aesthetics in photographic images using a computational approach,” ECCV 2006; Y. Ke, et al., “The design of high-level features for photo quality assessment,” CVPR, 2006; R. Datta, et al., “Learning the consensus on visual quality for next-generation image management,” ACM-MM 2007; Y. Luo, et al., “Photo and video quality evaluation: Focusing on the subject,” ECCV 2008, pp. 386-399, hereinafter, “Luo 2008”; R. Datta, et al., “Algorithmic inferencing of aesthetics and emotion in natural images: An exposition,” ICIP 2008; S. Dhar, et al., “High level describable attributes for predicting aesthetics and interestingness,” CVPR 2011, hereinafter, “Dhar 2011”; L. Marchesotti, et al., “Assessing the aesthetic quality of photographs using generic image descriptors,” ICCV 2011, pp. 1784-1791, hereinafter, “Marchesotti 2011”; and N. Murray, et al., “Ava: A large-scale database for aesthetic visual analysis,” CVPR 2012, hereinafter, “Murray 2012”.
Some aesthetic prediction methods have proposed to mimic the best practices of professional photographers. A general approach has been to select rules (e.g., “contains opposing colors”) from photographic resources, such as the book by Kodak, entitled “How to take good pictures: a photo guide,” Random House Inc., 1982, and then to design for each rule, a visual feature to predict the image compliance (e.g., a color histogram). More recently, attempts have focused on adding new photographic rules and on improving the visual features of existing rules. See, Luo 2008; Dhar 2011. Dhar 2011 suggests that the rules can be understood as visual attributes, i.e., medium-level descriptions whose purpose is to bridge the gap between the high-level aesthetic concepts to be recognized and the low-level pixels. See, also V. Ferrari, et al., “Learning visual attributes,” NIPS 2007; C. H. Lampert, et al., “Learning to detect unseen object classes by between-class attribute transfer,” CVPR 2009, pp. 951-958; and A. Farhadi, et al., “Describing objects by their attributes,” CVPR 2009.
However, there are several issues with such an approach to aesthetic prediction. First, the manual selection of attributes from a photographic guide is not exhaustive and does not give any indication of how much and when such rules are used. Second, manually designed visual features model only imperfectly the corresponding rules. As an alternative to rules and hand-designed features, it has been proposed to rely on generic features. See, Marchesotti 2011. Such generic features include the GIST (described in A. Oliva, et al., “Modeling the shape of the scene: a holistic representation of the spatial envelope, IJCV 42(3), 145-175, 2001), the bag-of-visual-words (BOV) (see, G. Csurka, et al., “Visual categorization with bags of keypoints,” Workshop on statistical learning in computer vision, ECCV, 2004) and the Fisher vector (FV) (see, F. Perronnin, et al., “Improving the Fisher kernel for large-scale image classification, ECCV 2010, pp. 143-156 (hereinafter, Perronnin 2010); and Marchesotti 2011).
While it has been shown experimentally that such an approach can lead to improved results with respect to hand-designed attribute techniques, one shortcoming is that the results lack interpretability. In other words, while it is possible to say that an image has a high or low aesthetic value, it is not possible to tell why. It would be advantageous to be able to preserve the advantages of generic features for predicting aesthetic quality while also providing interpretable results.