The exemplary embodiment relates to segmentation of images into semantic regions and in particular, to a system and method which generate a hierarchical segmentation of regions and then apply hierarchically trained semantic classifiers for labeling the regions.
For many applications involving the processing of images, or visual documents, one of the processing steps involves the segmentation of the entire image into regions of pixels which are labeled according to semantic class to aid in understanding a scene. For example, in the case of photographic images, this may entail creating pixel-level masks for classes such as “sky,” “road,” “people,” “buildings,” etc. The task of simultaneously segmenting the image and classifying (labeling) the different parts of the scene is called semantic segmentation or scene segmentation.
Semantic segmentation has been approached using a variety of methods. Many of these methods employ a classifier which based on region descriptors that aggregate some lower-level information, such as texture or color. See, for example, Gonfaus, et al., “Harmony potentials for joint classification and segmentation,” CVPR 2010 (“Gonfaus”); Gu, et al., “Recognition using regions,” CVPR 2009; Lim, et al., “Context by region ancestry,” ICCV 2009 (“Lim”); and Vijayanarasimhan, et al., “Efficient region search for object detection,” CVPR 2011.
In some approaches, an unsupervised segmentation method is first used which partitions the image into several regions. A classification algorithm can then be applied to the regions, described by aggregating low-level features, to determine if this part of the scene is sky, a building, road, and so forth. The region identification step serves to reduce the complexity of the classification method, allowing the classifiers to work at the super-pixel level instead of the pixel level and enforce label consistency between neighboring pixels. These methods are based on the assumption that pixels within a region are more likely to have the same label. Classifiers are often learnt (and then applied at test time) on this “bag-of-regions”.
One problem with such methods is that labeling is performed at the region level. In the event that a region groups multiple semantic classes, this often leads to incorrect labeling of the pixels. To overcome this problem, a hierarchy of regions has been proposed instead of a flat structure (single partition of the image into regions). See, for example, Gonfaus; and Lempitsky, et al., “A pylon model for semantic segmentation, NIPS 2011. For these methods, the hierarchical structure is used after the semantic scene recognition, during a post-processing step. This structure is used to enforce the label consistency between regions of the structure, by means of a conditional random field (CRF) model. In the method of Lim, all ancestors of a leaf region are treated as context. This method is based on the assumption that context can improve recognition. A super-pixel labeling problem is considered, and a hierarchy of segmentations, i.e., a region tree, is used. The super-pixels are the leaves of the tree, and the labeling decision for the super-pixels depends on all the ancestral regions i.e., all the parents of the leaf in the tree. A weight vector is learnt for each leaf region that concatenates weights for each parent region. Weights are learnt for each feature of each super-pixel. One problem with this method is that it does not scale well to large training sets as the information for all regions for all training images has to be retained.