Automatic video description, known as video captioning, refers to the automatic generation of a natural language description (e.g., a sentence) that narrates an input video. Video description can be widespread applications including video retrieval, automatic description of home movies or online uploaded video clips, video descriptions for the visually impaired, warning generation for surveillance systems and scene understanding for knowledge sharing between human and machine.
Video description systems extract salient features from the video data, which may be multimodal features such as image features representing some objects, motion features representing some actions, and audio features indicating some events, and generate a description narrating events so that the words in the description are relevant to those extracted features and ordered appropriately as natural language.
One inherent problem in video description is that the sequence of video features and the sequence of words in the description are not synchronized. In fact, objects and actions may appear in the video in a different order than they appear in the sentence. When choosing the right words to describe something, only the features that directly correspond to that object or action are relevant, and the other features are a source of clutter. In addition, some events are not always observed in all features.
Accordingly, there is a need to use different features inclusively or selectively to infer each word of the description to achieve high-quality video description.