Many practical applications rely on the availability of semantic information about the content of media, such as images, videos, etc. Semantic information is represented by metadata which may express the type of scene, the occurrence of a specific action/activity, the presence of a specific object, etc. Such semantic information can be obtained by analysing the media.
The analysis of media is a fundamental problem which has not yet been completely solved. This is especially true when considering the extraction of high-level semantics, such as object detection and recognition, scene classification (e.g., sport type classification), action/activity recognition, etc.
Recently, the development of various neural network techniques has enabled learning to recognize image content directly from the raw image data, whereas previous techniques consisted of learning to recognize image content by comparing the content against manually trained image features. Very recently, neural networks have been adapted to take advantage of visual spatial attention, i.e. the manner how humans conceive a new environment by focusing first to a limited spatial region of the scene for a short moment and then repeating this for a few more spatial regions in the scene in order to obtain an understanding of the semantics in the scene.
However, while providing good recognition accuracy, the semantic understanding of the image content in the known systems is rather limited. Also the computational complexity of these systems, despite of significant improvements recently, is still rather high.