Technical Field
The present description relates to techniques for processing of image signals (video signals). The present description has been developed with particular attention paid to the possible application to region-of-interest (ROI) detection, for example in applications of the type commonly referred to as “teleconference” or “telepresence”.
Description of the Related Art
Videoconference (or telepresence) is a technology of communication that enables communications between users located in positions that are remote from one another via a communication network.
In a typical system of this sort, each user has available a display, a video camera (such as, for example, a webcam), a microphone and an Internet connection, and the users are hence set in a condition where they can see and hear each other in real time, with the possibility of conducting a natural conversation with modalities of interaction that are not easy to achieve with voice-only communication technologies.
The corresponding advantages are appreciated to an increasing extent both in applications of a professional and working nature and in personal and private use with recourse to videoconference software such as Skype, Google Talk, Wengo, etc.
In the diffusion of telepresence technologies, in addition to commercial factors (market awareness) and to the definition of increasingly wide-range criteria of interoperability, a significant factor is the video quality, the latter being a factor that, in addition to the available bandwidth and the quality offered by the reproduction tools, depends upon the video resolution of the camera used and of the corresponding codec.
In addition to the efforts aimed at increasing in any case the bandwidth available for transmission, increasing attention is paid to the coding techniques, such as techniques based upon the visual-attention model (VAM), i.e., a formalization of how the human visual system (HVS) is able to distinguish objects that attract the eye and that thus acquire importance as compared to elements of lower attraction/interest.
The literature illustrates various techniques that facilitate combination of the information on colors, shapes, and motion to give rise to various VAMs so as to reproduce the visual attention of human observers. For a general review of these techniques reference may be made, for example, to documents such as:                B. Menser, M. Brunig, “Face Detection and Tracking for Video Coding Applications”, IEEE Conference on SSC, October-November 2000, Pacific Grove, Calif.;        Q. Chen et al., “Application of Scalable Visual Sensitivity Profile in Image and Video Coding”, ISCAS 2008, June 2008, Seattle, Wash.        
U.S. Patent Publication No. 2007/0076957 (entitled “Video Frame Motion-Based Automatic Region-Of-Interest Detection”) describes processing techniques for identification of regions of interest (ROIs) that are based, for example, on statistical data regarding a video signal and on processing information, at the video-camera end, so as to generate a map of the skin tones (skin map). This document describes also a technique of ROI detection with the use of motion information obtained during motion estimation in video processing so as to identify regions of interest. For the purposes of video communication, the regions of greater interest are identified with the faces so that the method described applies a face detector on the areas identified via information on the skin or information of motion, obtained independently, or via a combination of both. The choice of an adaptive type between factors linked to the presence of skin, motion, or both is correlated to considerations of homogeneity and of quality of the skin map (so-called intra-mode ROI detection) and to changes of the complexity of the motion (inter-mode skin detection).
Further documents, such as US Patent Publication No. 2006/0215752 (entitled “Region-Of-Interest Extraction for Video Telephony”) and US Patent Publication No. 2006/0215753 (entitled “Region-Of-Interest Processing for Video-Telephony”), describe, with reference to video-telephony applications, solutions in which the transmission and reception devices are equipped so as to be able to act in a symmetrical way both as transmitter and as receiver of video information. During operation as receiver, each device can define far-end information of a ROI type for the video signal encoded by the device when the latter operates as transmitter. During operation as transmitter, each device can define near-end ROI information for the video information transmitted to the other device that functions as receiver. The devices in question can hence be considered as “ROI-aware” in the sense that each of them is able to carry out processing, starting from the ROI information supplied by the other device, so as to be able to support a far-end control of the video coding on the basis of information of a ROI type. This solution can operate either on pre-defined configurations (for example, rectangular portions of image with different dimensions) or else on the basis of verbal, graphic, or text descriptions supplied by the remote user or via an automatic identification of a ROI type, for example based upon traditional schemes of face identification.
U.S. Pat. No. 6,343,141 B1 (entitled “Skin Area Detection for Video Image Systems”) proposes use of a skin detector to identify skin areas in video sequences to be used for a function of video coding/decoding. The detector identifies the regions of interest in the video frame by initially analyzing the shape of all the objects in a video sequence so as to locate one or more objects that could contain skin areas (for example, it is possible to exploit the fact that the faces have an approximately elliptical shape, causing the system to search objects of an elliptical shape). The detector then examines the pixels of the objects located to determine whether these present colorimetric characteristics typical of skin areas. The detector then compares the skin tones thus identified with tones of the entire frame so as to determine other possible regions with skin tones.