Computer vision plays a crucial role in robotics for analyzing and understanding surrounding environments. As one of the most challenging problems in computer vision, image labeling, which aims to assign a pre-defined semantic label to each pixel in an image, is a key step to understand an image. Several techniques attempt to predict an image label for a scene. Recurrent neural networks (RNNs) are frequently used to predict image labels for a scene. However, RNNs only deal with one modality of an image and are therefore less effective in predicting scene labels for multimodal images, e.g., RGB-D scenes.