Semantic segmentation aims to assign a categorical label to every pixel in an image, which plays an important role in image analysis and self-driving systems. Conventional systems use processes including: Decoding of Feature Representation and Dilated Convolution. In the pixel-wise semantic segmentation task with a decoding of feature representation, the output label map typically has the same size as the input image. Because of the operation of max-pooling or strided convolution in convolutional neural networks (CNNs), the size of feature maps of the last few layers of the network are inevitably downsampled. Multiple approaches have been proposed to decode accurate information from the downsampled feature map to label maps. Bilinear interpolation is commonly used as it is fast and memory efficient. Another popular method is called deconvolution, in which the unpooling operation, using stored pooling switches from the pooling step, recovers the information necessary for image reconstruction and feature visualization. In some implementations, a single deconvolutional layer is added in the decoding stage to produce the prediction result using stacked feature maps from intermediate layers. In other implementations, multiple deconvolutional layers are applied to generate chairs, tables, or cars from several attributes. Several studies employ deconvolutional layers as a mirrored version of convolutional layers by using stored pooled location in unpooling step. Other studies show that coarse-to-fine object structures, which are crucial to recover fine-detailed information, can be reconstructed along the propagation of the deconvolutional layers. Other systems use a similar mirrored structure, but combine information from multiple deconvolutional layers and perform upsampling to make the final prediction. Some systems predict the label map by applying a classifier on a per-pixel basis, as it is more statistically efficient.
Dilated Convolution (or Atrous convolution) was originally developed for wavelet decomposition. The main idea of dilated convolution is to insert “holes” (zeros) between pixels in convolutional kernels to increase image resolution, thus enabling dense feature extraction in deep CNNs. In the semantic segmentation framework, dilated convolution is also used to enlarge the field of convolutional kernels. Some prior systems use serialized layers with increasing rates of dilation to enable context aggregation, while designing an “atrous spatial pyramid pooling (ASPP)” scheme to capture multi-scale objects and context information by placing multiple dilated convolution layers in parallel. More recently, dilated convolution has been applied to a broader range of tasks, such as object detection optical flow, visual question answering, and audio generation. However, these conventional systems can cause a “gridding issue” produced by the standard dilated convolution operation. Other conventional systems lose information in the downsampling process and thus fail to enable identification of important objects in the input image.
Object contour detection is a fundamental problem for numerous vision tasks, including image segmentation, object detection, semantic instance segmentation, and occlusion reasoning. Detecting all objects in a traffic environment, such as cars, buses, pedestrians, and bicycles, is crucial for building an autonomous driving system. Failure to detect an object (e.g., a car or a person) may lead to malfunction of the motion planning module of an autonomous driving car, thus resulting in a catastrophic accident. The semantic segmentation framework provides pixel-level categorical labeling, but no single object-level instance can be discovered. Current object detection frameworks, although useful, cannot recover the shape of the object or deal with the occluded object detection problem. This is mainly because of the limits of the bounding box merging process in the conventional framework. In particular, problems occur when nearby bounding boxes that may belong to different objects get merged together to reduce a false positive rate, thus making the occluded object undetected, especially when the occluded region is large.