Computing power of modern high-speed computers has reached an impressive level, however, computer vision systems can't perform visual tasks which are extremely simple for human, such as guiding road crossing. This is mainly because when facing a flood of visual information input, human eyes can selectively focus on significantly changed regions in the visual scene in a short time and then analyze them to adapt to environmental changes. While the computer vision systems will indiscriminately treat all regions of the visual scene, so it can't understand changes of the scene and may cause a computing bottleneck. If the selective attention function of the human visual system is introduced into the computer vision systems, it is bound to enhance the existing efficiency of computers in analyzing images.
Detection of a visual salient region of a video has a wide range of applications, for example, in video compression. When a video needs to be compressed, it is always desired that meaningful contents in the video are compressed with a relatively low compression ratio, and background regions that are not so important are compressed with a relatively high compression ratio. If this may be achieved by a device automatically, a visual saliency of each region in each frame of the video needs to be firstly determined, to identify meaningful contents in the video.
In literatures regarding detection of a visual saliency, a visually salient region is generally defined as a partial image block which has a global conspicuity in a frame of an image or a video. A common implementation of this definition is: dividing a frame of an image or a video into a plurality of image blocks; then calculating a dissimilarity of each image block with respect to each of the other image blocks; finally, each of the image blocks that has a relatively high dissimilarity is considered as a relatively salient region. Wherein, a method for determining dissimilarity may be comparing contrasts in features such as color, orientation, texture, movement and the like, of two image blocks. Another definition is that a region that has a large contrast with an adjacent region is a relatively salient region. A main difference between implementation of this definition and that of the above definition based on a global conspicuity lies in that, dissimilarity between each image block and its surrounding image blocks, rather than dissimilarity between each image block and all the image blocks in the current image, is determined.
Generally, in the above two methods, what is mainly considered is dissimilarity between image blocks. However, in fact, distances between image blocks also directly relates to visual saliencies. Relevant studies on human perception tissues show that, salient regions in an image will appear in the image in a relatively compact manner. That is, in the image, if a partial image block is similar to image blocks within a short distance to it, the image block is probably salient. If a distance between two image blocks is relatively large, one of them will contribute less to a saliency of the other even though they are similar to each other. Therefore, in an image, contribution of one image block to a saliency of another image block increases as dissimilarity between them increases, and decreases as a distance between them increases.
Moreover, relevant studies on human visual systems show that, in observation of a visual scene, human eyes have a characteristic of central bias. Statistics on distribution of fixations of human eyes in observing a large number of images recorded by a gaze tracker also show that, although for a few images, fringe regions of an image may have relatively salient contents, while in general, an average attention degree of a human eye to a region of an image decreases with a distance between the region and a center region of the image increases.
Patent Application No. 201010522415.7 discloses a method for detecting visual saliencies of different regions in an image, in which a saliency of each image block is measured in features of appearance, position and distance to the center. However, the detection method only considers differences in spatial features between two image blocks, and ignores differences in motion features between them. In fact, when a person watches a video, motion features are key factors to appeal human eyes and a human visual system allocates many resources on motion perception. Moreover, human eyes are capable of keeping track of a target object. Therefore, differences in motion features should be considered in measurement of saliencies of image blocks in a video.