An image may contain multiple things, including objects and stuff. As used herein, the term “objects” refer to the things that have consistent shape and each instance is countable. Examples of the objects include, but are not limited to, people, animals, cars, and the like. The term “stuff” refers to the things that have consistent color or textures and arbitrary shapes. Examples of the stuff include, but are not limited to, grass, sky, water, and the like. The imaging process usually composites the appearances of these things. Image semantic segmentation aims to recover the image regions corresponding directly to things in an image by labeling each pixel in the image to a semantic category. Contrary to the object recognition which merely detects the objects in the image, the semantic segmentation assigns a category label to each pixel to indicate an object or a stuff to which the pixel belongs.
Convolutional neural networks (CNNs) can be used in image semantic segmentation. For example, two types of CNN features can be extracted. The region features are extracted from proposal bounding boxes, and the segment features are extracted from the raw image content masked by the segments. The concatenation of those two types of features is used to train classifiers.