Semantic segmentation aims to assign a categorical label to every pixel in an image, which plays an important role in image analysis and self-driving systems. Conventional systems use processes including: Decoding of Feature Representation and Dilated Convolution. In the pixel-wise semantic segmentation task with a decoding of feature representation, the output label map typically has the same size as the input image. Because of the operation of max-pooling or strided convolution in convolutional neural networks (CNNs), the size of feature maps of the last few layers of the network are inevitably downsampled. Multiple approaches have been proposed to decode accurate information from the downsampled feature map to label maps. Bilinear interpolation is commonly used as it is fast and memory efficient. Another popular method is called deconvolution, in which the unpooling operation, using stored pooling switches from the pooling step, recovers the information necessary for image reconstruction and feature visualization. In some implementations, a single deconvolutional layer is added in the decoding stage to produce the prediction result using stacked feature maps from intermediate layers. In other implementations, multiple deconvolutional layers are applied to generate chairs, tables, or cars from several attributes. Several studies employ deconvolutional layers as a mirrored version of convolutional layers by using stored pooled location in unpooling step. Other studies show that coarse-to-fine object structures, which are crucial to recover fine-detailed information, can be reconstructed along the propagation of the deconvolutional layers. Other systems use a similar mirrored structure, but combine information from multiple deconvolutional layers and perform upsampling to make the final prediction. Some systems predict the label map by applying a classifier on a per-pixel basis, as it is more statistically efficient.
Dilated Convolution (or Atrous convolution) was originally developed for wavelet decomposition. The main idea of dilated convolution is to insert “holes” (zeros) between pixels in convolutional kernels to increase image resolution, thus enabling dense feature extraction in deep CNNs. In the semantic segmentation framework, dilated convolution is also used to enlarge the field of convolutional kernels. Some prior systems use serialized layers with increasing rates of dilation to enable context aggregation, while designing an “atrous spatial pyramid pooling (ASPP)” scheme to capture multi-scale objects and context information by placing multiple dilated convolution layers in parallel. More recently, dilated convolution has been applied to a broader range of tasks, such as object detection optical flow, visual question answering, and audio generation.
However, these conventional systems can cause a “gridding issue” produced by the standard dilated convolution operation. Other conventional systems lose information in the downsampling process and thus fail to enable identification of important objects in the input image.