The present invention relates to analysis of video quality, and more particularly to an improved visual attention model for automatically determining regions of interest within images of a video signal.
Models of early visual systems that are tuned appropriately provide accurate predictions of the location of visible distortions in compressed natural images. To produce an estimate of subjective quality from fidelity maps, current state-of-the-art quality metrics perform a simple summation of all visible errors. This fails to take into account any higher level or cognitive factors that are known to occur during the subjective assessment of picture quality.
The influence that a distortion has on overall picture quality is known to be strongly influenced by its location with respect to scene content. The variable resolution nature of the Human Visual System (HVS) means that high acuity is only available in the fovea, which has a diameter of about 2 degrees. Knowledge of a scene is obtained through regular eye movements to reposition the area under foveal view. Early vision models assume an xe2x80x9cinfinite foveaxe2x80x9d, i.e., the scene is processed under the assumption that all areas are viewed by the high acuity fovea. However studies of eye movements indicate that viewers do not foveate all areas in a scene equally. Instead a few areas are identified as regions of interest (ROIs) by human visual attention processes and viewers tend to repeatedly return to these ROIs rather than other areas that have not yet been foveated. The fidelity of the picture in these ROIs is known to have the strongest influence on overall picture quality.
The knowledge of human visual attention and eye movements, coupled with selective and correlated eye movement patterns of subjects when viewing natural scenes, provides a framework for the development of computational models of human visual attention. The studies have shown that people""s attention is influenced by a number of different features that are present in the picturexe2x80x94motion, luminance contrast, color contrast, object size, object shape, people and faces, location within the scene, and whether the object is part of the foreground or background. A handful of simple visual attention models have been proposed in the literature. These models aim to detect the ROIs in a scene in an unsupervised manner. They have generally been designed for use with uncomplicated still images. A number of deficiencies are apparent in these models which prevent their use as robust attention models for typical entertainment video. These include: the limited number of attention features used; the failure to apply different weights to the different features; the lack of robustness in segmentation techniques; the absence of a temporal model; and the oversimplified algorithms used to extract the attention features. None of the proposed models have been demonstrated to work robustly across a wide range of picture content and their correlation to people""s eye movements has not been reported.
As indicated in the paper entitled xe2x80x9cA Perceptually Based Quantization Technique for MPEG Encodingxe2x80x9d, Proceedings SPIE 3299xe2x80x94Human Vision and Electronic Imaging III, San Jose, USA, pp. 48-159, 26-29 January 1998 by Wilfried Osberger, Anthony J. Maeder and Neil Bergmann, a technique is disclosed for automatically determining the visually important areas in a scene as Importance Maps (IMs). These maps are generated by combining factors known to influence human visual attention and eye movements, as indicated above. For encoding lower quantization is assigned to visually important areas and areas of lesser visual importance have a harsher quantization assigned. Results indicate a subjective improvement in picture quality.
In this prior technique segmentation is performed using a classic recursive split-and-merge segmentation. After segmentation the results were processed by five spatial features to produce individual spatial importance maps: contrast; size; shape; location and background. Motion also was taken into consideration to produce a temporal importance map. Each of these individual importance maps were squared to enhance areas of high importance and then weighted equally to produce a final IM. However it was felt that this technique was not robust enough.
What is desired is an automatic way to predict where ROIs are likely to be located in a natural scene of a typical entertainment video using the properties of human attention and eye movements that is more robust than prior techniques.
Accordingly the present invention provides method for automatically identifying regions of interest in a video picture using a visual attention model. A current frame is adaptively segmented into regions based upon both color and luminance. Each region is processed in parallel by a plurality of spatial feature algorithms including color and skin to produce respective spatial importance maps. The spatial importance maps are combined to produce an overall spatial importance map, the combining being based upon weights derived from eye movement studies. The current frame and a previous frame also are processed to produce motion vectors for the current frame, after which the motion vectors are corrected for camera motion before being converted into a temporal importance map. The overall spatial importance map and the temporal importance map are combined by linear weighting to produce a total importance map for the current frame, with the linear weighting constant being derived from the eye motion studies.
The objects, advantages and other novel features of the present invention are apparent from the following detailed description when read in conjunction with the appended claims and attached drawing.