Semantic segmentation of an image involves labeling or mapping each pixel in the image to a set of classes. As an example, semantic segmentation is used to interpret road scenes in advanced driver assistance systems (ADAS) where it is termed “road segmentation.” Conventional image segmentation systems use fully convolutional neural networks (FCN) that essentially decimate a full resolution image to a segmented lower resolution image. This lower resolution image is interpolated back to the scale of the original image.
Traditionally FCNs work by training a convolution neural network (CNN) to classify the central pixel of a small patch extracted from a scene. The FCN is then built by applying the same CNN to an entire scene, or frame, resulting in a segmented image with a decimated resolution. This decimated image is then up-sampled to the size of the original image so that every pixel is classified. This can be done for several resolutions at different layers of the FCN.
Segnet is a variation of FCN that trains, pair-wise, “encoders” and, respectively, “decoders.” Feature maps generated at the output of an encoder are combined (filtered) by the corresponding decoder to create a segmented image. Thus, encoder-decoder pairs are individually trained at different resolutions and then combined in a nested architecture to achieve higher accuracy. Here upsampling is carried out at the decoder layers by utilizing the retained pooling-indices at the respective encoder layers. Additionally, a “flat” architecture, with identical output feature size at each layer of the encoder and decoder, and a constant kernel size of 7×7, across all the layers, are adopted.
The computational complexity of FCN and similar approaches such as Segnet in a practical real-time application such as ADAS can be computationally demanding and can lead to high power consumption. Thus, a need exists for an image segmentation system with a lower computational complexity than FCN that may be deployed in practical real-time applications.