As a neurobiological conception, attention implies the concentration of mental powers upon an object by close or careful observation. Attention area is the area in a picture where tends to catch more human attention. The system designed to automatically detect the attention area of a picture is called attention model. The detected attention area is widely utilized in many kinds of applications, such as accumulating limited resource in an attention area, directing retrieval/search, simplifying analysis, etc.
FIG. 1 indicates the general architecture of a mostly used attention model. First, an image to be estimated is inputted into the attention model. Then the feature of intensity, colour, orientation, etc. will be achieved after the step of feature extraction. In the third step the salience of said features are estimated. After the steps of fusion scheme and post-processing the attention area is finally got.
Different from attention models used in most previous machine vision systems which drive attention based on the spatial location hypothesis with macro-block (MB) being the basic unit, other models which direct visual attention are object-driven, called object-based visual attention model.
A lot of researches on MB (macro-block) spatial-based visual attention are established as proposed by L. Itti et al., “A Model of Salience-Based Visual Attention for Rapid Scene Analysis”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 20, No. 11, November 1998 and by Y. F. Ma et al., “A User Attention Model for Video Summarization”, ACM Multimedia'02, pp. 533-542, December 2002. However, object-based visual attention is not so widely studied because of its inherent difficulty. Y. Sun et al. propose a framework of object-based visual attention in “Object-based Visual Attention for Computer Vision”, Artificial Intelligence, pp. 77-123, May 2003. Another object-based visual attention model is presented by F. Orabona et al., “Object-based Visual Attention: a Model for a Behaving Robot”, 3rd International Workshop on Attention and Performance in Computational Vision, June 2005. Both object-based visual attention schemes still follow the general architecture of attention model listed in FIG. 1. All the processes except “salience estimation” are directly inherited from Itti's MB spatial-based visual attention model.
No matter in MB spatial-based or in object-based visual attention models, low level spatial/temporal features are first extracted, and then for each salient (different, outstanding from its surroundings; or say, more attractive) feature map of each unit is estimated over the whole picture, after that a master “salience map” is generated by feeding all feature maps in a purely bottom-up manner.
Compared with object-based visual attention model, the MB spatial-based visual attention model is a much easier and faster creation. However, it has several inherent disadvantages:
1) The attention area breaks natural object boundary;
2) Each micro-block may cover lots of natural objects.
So, the extracted feature of the micro-block is a mixed property of all these natural objects and thus will lower down attention area detection precision.
The key issue of the object-based visual attention model lies in two aspects: one is the object grouping before feature extraction, the other is the particular efficient salience estimation of each object over all the objects in the image. The central idea of the currently used salience estimation scheme is based on Gauss distance measure as presented by Y. Sun et al.
Denote x as the object to be salience estimated, yi (i=1, 2, . . . , n) as all the background objects, was the maximum of the width and height of the input image, and //x−yi// as the physical distance between x and yi, so the Gauss distance is defined as the formula (1),
                                          d            gauss                    ⁡                      (                          x              ,                              y                i                                      )                          =                              (                          1              -                                                ||                                      x                    -                                          y                      i                                                        ||                                                  w                  -                  1                                                      )                                ⅇ                                          1                                  2                  ⁢                                                                          ⁢                                      σ                    2                                                              ||                              x                -                                  y                  i                                            ⁢                              ||                2                                                                        (        1        )            with the scale σ set to w/ρ, where ρ is a positive integer and generally 1/ρ may be set to a percentage of w such as 2%, 4%, 5% or 20%, 25%, 50%, etc.
Denote SF(x, yi) as the absolute difference of object x and yi in feature F, then the salience estimation SF(X) as the overall salience degree of object x in feature F can be expressed as Formula (2).
                                          S            F                    ⁡                      (            x            )                          =                                            ∑                              i                =                1                            n                        ⁢                                                  ⁢                                                            S                  F                                ⁡                                  (                                      x                    ,                                          y                      i                                                        )                                            ·                                                d                  gauss                                ⁡                                  (                                      x                    ,                                          y                      i                                                        )                                                                                        ∑                              i                =                1                            n                        ⁢                                                  ⁢                                          d                gauss                            ⁡                              (                                  x                  ,                                      y                    i                                                  )                                                                        (        2        )            
By the definition of the salience estimation, it can be concluded that:
1. The larger difference between the object and its surroundings exists, the more salient the object is.
2. The closer the object and its feature differed surroundings is, the more salient the object is. That is, human vision decreases its ability to distinguish the difference according to distance. The attenuation coefficient is measured by dgauss, which is coherent with the visual physiology thesis.
This guarantees SF(x) is a useful salient estimation in feature F. Unfortunately, some important human perception properties are not considered in SF(x).
FIG. 2a is an original image of Skating to be estimated and FIG. 3a is the salience estimation result of FIG. 2a using the conventional object-based visual attention model.
FIG. 2b is an original image of Coastguard to be estimated and FIG. 3b is the salience estimation result of FIG. 2b using the conventional object-based visual attention model.
Both in FIG. 3a and FIG. 3b, white colour means a very outstanding object while black colour means not salient one, the grey level between white and black represents the salience degree.
From FIG. 3a we can see that the audience is considered salient, because its colour greatly differs from its neighbour's, but actually the part of audience contains no details. Viewers usually will not focus on the audience and recognize it as “video texture”.
Also in FIG. 3a, there is a little grey block on the left of the female dancer's head. The block consists of a piece of white skating rink which is circled by black male clothing and female skin, and it is salient in this local area. But when all comes to all, this block is a part of the large skating rink and will not attract viewers' attention. This is called “Local effect”. Because of the local effect, the accumulated difference between the object and its neighbours is large and thus it is recognized as “salience”.
From forgoing description we can see that the conventional object-based visual attention model is not efficient enough and there are a lot of human vision properties not considered:
1. Object size—The estimation of the influence that the object size on salience degree is a complex problem. For example, (a) if all neighbouring objects yi are of the same size s and the size of object x decreases from s to 0, as a result the salience degree of x (SF(x)) will decrease gradually, (b) if all neighbouring objects yf are of the same size s and the size of object x decreases from s1 to s2 (s1>>s, and s1>s2>s), SF(X) will increase gradually. Thus we know that the relationship between object size and salience degree is not monotonous. And the problem becomes even more complex when each of the objects may have an arbitrary size.
2. Local effect—If an object is not salient among its near neighbours (local area) while the far neighbours are greatly different from the object, there are two possible results: (a) the object is not salient at all inside the whole image; (b) the local area as a whole is salient inside the image with the object being a member of the local area. No matter in which case, the salient degree of the object does not match what defined above.
3. Video texture—Suppose the object features of an image are uniformly random, human will usually ignore the details of the whole image and not any object of the image is salient, while the above defined SF(x) will be a large number for any of the objects in the image.
With all these limitations, the conventional object-based visual attention model is far from applicable. Therefore an improved object-based visual attention model is desirable.