In the technology field of video content analysis, visual attention is the ability to rapidly detect the interesting parts of a given scene. In a typical spatiotemporal visual attention computing model, low level spatial/temporal features are extracted and a master “saliency map” which helps identifying visual attention is generated by feeding all feature maps in a purely bottom-up manner. Identifying visual attention for each of the image sequence, the attention trajectory is then indicated. However, several inherent disadvantages arise in the conventional attention computing scheme: 1) since there are varies of features competed in saliency map, a slight change of any of these features may lead to result differ, which means that so calculated attention trajectory is unstable and blinking time by time; 2) the attention may be fully or partially omitted because of shelter, position of critical saliency degree, or attention boundary etc. in a specific time slot; 3) it may produce noise or very short-life attention, when adopting in attention-based video compression/streaming or other applications, such an un-smooth attention will lead to subjective quality degradation.
As shown in FIG. 1 which indicates the general architecture of Itti's Attention Model. In the Itti's attention model, which is presented by L. Itti, C. Koch and E. Niebur, in “A Model of Saliency-Based Visual Attention for Rapid Scene Analysis”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 11, November 1998, visual input is first decomposed into a set of topographic feature maps. Different spatial locations then compete for saliency within each map, such that only locations which locally stand out from their surround can persist. All feature maps feed, in a purely bottom-up manner, into a master “saliency map”, which topographically codes for local conspicuity over the entire visual scene.
As an extension of Itti's attention model, Y. F. Ma etc. take temporal features into account, published by Y. F. Ma, L. Lu, H. J. Zhang and M. J. Li, in “A User Attention Model for Video Summarization”, ACM Multimedia '02, pp. 533-542, December 2002. In this model, the motion field between the current and the next frame is extracted and a set of motion features, such as motion intensity, spatial coherence and temporal coherence are extracted.
The attention model created by the above scheme is sensitive to feature changes, which lead to un-smooth attention trajectory across time as follows:
(1) Successive images in image sequence are very similar and viewers will not tend to change their visual focus during a time slot, unfortunately, the slight changes between these successive images will make the calculated attention great differ;
(2) When an attention object becomes non-attention or sheltered by a non-attention object for a short period, viewers will not change their visual focus because of their memory knowledge, again, attention models fail to indicate this; and
(3) Attention models always generate short-life attention or noise, which in fact will not be able to attract viewer's attention.
In attention-based video applications like ROI (Region of Interest)-based video coding, such un-smoothness will lead to subjective visual quality degradation. In ROI-based video coding, more resource are allocated to the more attractive ROI and thus a more clear ROI while related blurred non-ROI. With an un-smooth ROI trajectory, viewer focused in ROI will notice the changing quality (become clear or blurred from time to time) which lead to an unhappy experience.
Therefore it is desirable to develop an improved method of emendation for attention trajectory to reduce the influence of these disadvantages and make the generated attention smooth.