Semantic segmentation, which aims to predict a category label for every pixel in the image, is an important task for scene understanding. It is a challenging problem due to large variations in the visual appearance of the semantic classes and complex interactions between various classes in the visual world. Recently, convolutional neural networks (CNNs) have been shown to work well for this challenging task. However, convolutional neural networks may not be optimal for structured prediction tasks such as semantic segmentation as they do not model the interactions between output variables directly.
Various semantic segmentation methods use a discrete conditional random field (CRF) on top of CNNs. By combining CNNs and CRFs, these methods provide the ability of CNNs to model complex input-output relationships and the ability of CRFs to directly model the interactions between output variables. Majority of these methods use CRF as a separate post-processing step. Usually, a CNN processes the image to produce unary energy, which in turn processed by CRF to label the image. However, CRF has different principles of operations than CNN. That disconnects CNN from CRF and prevents their join training In general, CRF is either manually tuned or separately trained from CNNs.
One method, instead of using CRF as a post-processing step, trains a CNN together with a discrete CRF by converting the inference procedure of the discrete CRF into a recurrent neural network. However, in general, the inferences on discrete CRFs are intractable due to discrete and non-differentiable nature of the CRF formulation. To that end, that method uses approximate inference procedures that does not have global optimum guarantees and can lead to poor training result.