The exemplary embodiment relates to semantic classification of images. It finds particular application in connection with the assignment of object classes to pixels or regions of an image, and will be described with particular reference thereto.
Automated techniques have been developed for image classification. These techniques rely on classifiers which have generally been trained on a set of manually labeled training images. A new image can then be labeled as having a probability that it contains a certain type of object, such as sky, a person, a face, a car, a flower, an animal, a building, or the like. The labels can be used for determining appropriate further processing of the image, such as suitable image enhancements in an automated image processing system. Alternatively, the labels can be used for archiving images or in retrieval systems, for example, to provide responsive images to a user's search query, such as a search for pictures of people.
In general, such image classification techniques do not attempt to locate the objects within an image. Such information would be useful, for example, for a variety of applications, such as image cropping, content based local image enhancement or rendering, insertion techniques which involve selecting a part of one image to be incorporated into the same or another image, and the like. Currently, localization of objects in images relies on grouping pixels into homogeneous regions, based on low level information, such as the color of pixels or texture. Thus, for example, ‘sky’ may be inferred to be localized in a patch of uniform blue pixels. For many objects, however, such localization techniques tend to be unreliable.
Other approaches have been attempted for recognition and localization of objects. For example, in the method of Liebe, et al., image patches are extracted and matched to a set of codewords learned during a training phase (B. Leibe, A. Leonardis, and B. Schiele, ‘Combined object categorization and segmentation with an implicit shape model,’ in ECCV Workshop on Statistical Learning for Computer Vision, 2004). Each activated codeword then votes for possible positions of the object center. Others have proposed to combine low-level segmentation with high-level representations. Borenstein, et al., for example, computes a pixel probability map using a fragment-based approach and a multi-scale segmentation (E. Borenstein, E. Sharon, and S. Ullman, “Combining top-down and bottom-up segmentation,” in CVPR, 2004). The pixel labeling takes into account the fact that pixels within homogeneous regions are likely to be segmented together. Russell, et al. and Yang, et al. perform respectively normalized cuts and mean-shift segmentation and compute bags-of-keypoints at the region level (B. Russell, A. Efros, J. Sivic, W. Freeman, and A. Zisserman, ‘Using multiple segmentations to discover objects and their extent in image collections,’ in CVPR, 2006; L. Yang, P. Meer, and D. J. Foran, ‘Multiple class segmentation using a unified framework over mean-shift patches, in CVPR, 2007). Cao, et al. uses Latent Dirichlet Allocation (LDA) at the region level to perform segmentation and classification and force the pixels within a homogeneous region to share the same latent topic (L. Cao and L. Fei-Fei, ‘Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes,’ in ICCV, 2007). Others rely on low-level cues to improve the semantic segmentation without the need to perform explicit low-level segmentation. The different cues are generally incorporated in a random field model, such as a Markov random field (MRF). As local interactions are insufficient to generate satisfying results, global supervision is incorporated in the MRF. In the LOCUS algorithm, described by Winn, et al., this takes the form of prototypical class mask which can undergo deformation (J. Winn and N. Jojic, ‘Locus: Learning object classes with unsupervised segmentation,’ in ICCV, 2005). In other methods, it takes the form of a latent model (see J. Verbeek and B. Triggs, ‘Region classification with Markov field aspects models,’ in CVPR, 2007; and M. Pawan Kumar, P. H. S. Torr, and A. Zisserman, ‘Obj cut,’ in CVPR, 2005).
While the MRF is generative in nature, the conditional random field (CRF) models directly the conditional probability of labels given images. He, et al., incorporates region and global label features to model shape and context (X. He, R. Zemel, and M. Á. Carreira-Perpiñán, ‘Multiscale conditional random fields for image labeling,’ in CVPR, 2004). Kumar, et al. proposes a two-layer hierarchical CRF which encodes both short- and long-range interactions (S. Kumar and M. Hebert, ‘A hierarchical field framework for unified context-based classification,’ in ICCV, 2005). Textonboost is a discriminative model which is able to merge appearance, shape and context information (J. Shotton, J. Winn, C. Rother, and A. Criminisi, ‘Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation,’ in ECCV, 2006). Winn, et al. proposes the layout consistent random field, an enhanced version of the CRF which can deal explicitly with partial occlusion (J. Winn and J. Shotton, ‘The layout consistent random field for recognizing and segmenting partially occluded objects,’ in CVPR, 2006). Verbeek, et al. addresses the case of partially labeled images (J. Verbeek and B. Triggs, ‘Scene segmentation with crfs learned from partially labeled images,’ in NIPS, 2007).
There remains a need for improved methods for semantic segmentation of an image which allows different segments of an image to be labeled according to respective object classes.