Video captioning, i.e., automatically describing the content of a video using natural language, is a challenging task in computer vision. Lots of practical applications such as auxiliary aid for visually impaired people, human computer interaction, and video retrieval can benefit from video captioning, thus it has drawn great research attention. In general, video captioning systems can be roughly divided into two components: video representation and sentence generation.
Traditional approaches used various visual classifiers/trackers to detect visual concepts and then generate sentences with predefined language templates. For video representation, these approaches rely on handcrafted features which do not generalize well and cannot be trained in an end-to-end manner.
With the rapid development of deep learning, two major changes have been made to video captioning systems: convolutional neural networks (CNNs) for video representation and recurrent neural networks (RNNs) for sequence modeling. Earlier researchers directly extracted global feature (i.e., a single vector to represent one frame) of video frames from a pre-trained CNN and fed to RNNs for sentence generation. While these plain sequence-to-sequence approaches can achieve significant improvements over traditional methods, they still suffer from loss of both spatial and temporal information in videos.
Some works tried to exploit the temporal structure of videos by adaptively assigning weights to video frames at every word generation step, which is known as temporal attention. But in these works, video frames are still represented by global feature vectors extracted from CNNs. Thus, the rich visual contents in video frames are not fully exploited.
In “Translating videos to natural language using deep recurrent neural networks” (Venugopalan et al., NAACL, 2015), its system has adopted CNN and RNN for video captioning, in which video representation is obtained by mean-pooling CNN features extracted from a sequence of sampled video frames and then fed it to LSTM (Long Short-Term Memory) for caption generation. This approach actually treated video as an image and ignored the temporal structure of videos. Thus, following works try to encode the videos while exploiting their structures. “Sequence to sequence-video to text” (Venugopalan et al., ICCV, 2015) first encodes the video feature sequence with two layers of LSTM and then the language generation (decoding) is conditioned on the final encoding state. The LSTMs in these two stages share the same parameters. This kind of encoding-decoding approach has been successfully applied to neural machine translation (see “Sequence to sequence learning with neural networks”, Sutskever et al., NIPS, 2014). In “Describing videos by exploiting temporal structure” (Yao et al., ICCV, 2015), it is exploited the temporal structure of a video by introducing soft-attention mechanism in the decoding stage, which assigns weights to video frames calculated from the decoder state and video features. In “Hierarchical boundary-aware neural encoder for video captioning” (Baraldi et al., CVPR, 2017), it is further proposed to model the hierarchical structure of videos by detecting the shot boundaries while generating captions. In “Bidirectional multirate reconstruction for temporal modeling in videos” (Zhu et al., CVPR, 2017), it is also proposed Multirate Gated Recurrent Unit to encode frames of a video clip with different intervals, so that the model can be capable of dealing with motion speed variance.
Due to the large amount of video data, spatial information had been overlooked in video captioning due to the high computational cost. But in image captioning, spatial information is widely utilized through attention. In “Show, attend and tell: Neural image caption generation with visual attention” (Xu et al., ICML, 2015), two forms of attention mechanism are proposed for image captioning. One is stochastic hard attention, which selects a single image region according to a multinoulli distribution and requires Monte Carlo sampling to train. The other is a differentiable approximation of the former, which computes weights for all the image regions and then a weighted sum over all the regional features. Although the hard attention was shown to give better performance, later researchers have preferred the soft approximation for its ease of training. In “Attention correctness in neural image captioning” (Liu et al., AAAI, 2017), it is shown that if supervision for attention is available during training image captioning models, the trained models can better locate regions that are relevant to the generated captions. However, due to the vast amount of video data, there are no such fine-grained spatial annotation in existing video captioning datasets.
Recently, there are works that try to incorporate spatial attention in video captioning. Li et al. apply region-level (spatial) soft attention to every video frame and then frame-level (temporal) attention to all the frames to obtain a multi-level attention model for video captioning (see “MAM-RNN: multi-level attention model based RNN for video captioning”, IJCAI, 2017). Yang et al. propose to generate spatial attention under the guidance of global feature, which is the mean-pooled regional features (see “Catching the temporal regions-of-interest for video captioning”, ACM MM, 2017). They also designed a Dual Memory Recurrent Model (DMRM) to incorporate the information of previously encoded global and regional features. The MAM-RNN applies spatial attention in the encoding stage followed by temporal attention in the decoding stage. The spatial attention maps are directly propagated during encoding. However, in the works of Li et al. and Yang et al., the spatial attentions are generated from the regional features and recurrent states of the RNNs, without direct guidance.
“Two-stream convolutional networks for action recognition in videos” (Simonyan et al., NIPS, 2014) and “Temporal segment networks: Towards good practices for deep action recognition” (Wang et al., ECCV, 2016) have shown that CNNs trained on multi-frame dense optical flow is able to achieve good action recognition performance in spite of limited amount of training data. Although the C3D network (see “Learning spatiotemporal features with 3d convolutional networks”, Tran et al., ICCV, 2015), which operates on consecutive RGB frames has also been proven to be successful for recognizing action in videos, it requires training on large-scale datasets. As a result, video captioning approaches have always used motion information from C3D as just another modality for fusion only. Venugopalan et al. tries to feed optical flow to a CNN pre-trained on UCF101 video dataset for feature extraction and then for multi-modal fusion. None of these works have used optical flow as guidance for visual attention.
There is a need to provide a new and different mechanism for video captioning.