Video has become ubiquitous on the Internet, broadcasting channels, as well as that generated by personal devices. This has encouraged the development of advanced techniques to analyze semantic video content for a wide variety of applications. These applications include editing, indexing, search and sharing. Recognition of videos has been fundamental predominantly focused on recognizing videos with a predefined yet very limited set of individual words.
Existing video description approaches mainly optimize the next word given the input video and previous words, while leaving the relationship between the semantics of the entire sentence and video content unexploited. As a result, the generated sentences can suffer from a lack of robustness. It is often the case that the output sentence from existing approaches may be contextually correct but the semantics (e.g., subjects, verbs or objects) in the sentence are incorrect.