There are some techniques used to determine which elements of a visual scene are likely to attract the attention of observers. However, these attention models are limited with respect to predicting which elements are likely to attract attention in a dynamic scene such as a video data sequence that presents multiple visual, audio, and linguistic stimuli over time. This is because such existing techniques are designed to model one attention or another, not model one attention in view of another attention. For instance, the computational visual attention models were developed for static scene analysis or controlling vision system by camera parameters. A comprehensive user attention was not studied concerning visual, audio and linguistic channels of video. Thus, existing attention modeling techniques are substantially limited when applied to video data that consists of a combination of visual, audio, and linguistic information.
The following systems and methods address these and other limitations of conventional techniques to determine which elements of a video data sequence are likely to attract human attention.