The invention relates to the field of video image processing, and in particular to a method and system for image processing in video conferencing as described in the preamble of the corresponding independent claims.
Effective communication using current video conferencing systems is severely hindered by the lack of eye contact caused by the disparity between the locations of the subject and the camera. While this problem has been partially solved for high-end expensive video conferencing systems, it has not been convincingly solved for consumer-level setups.
It has been firmly established [Argyle and Cook 1976; Chen 2002; Macrae et al. 2002] that mutual gaze awareness (i.e., eye contact) is a critical aspect of human communication, both in person or over an electronic link such as a video conferencing system [Grayson and Monk 2003; Mukawa et al. 2005; Monk and Gale 2002]. Thus, in order to realistically imitate real-world communication patterns in virtual communication, it is critical that the eye contact is preserved. Unfortunately, conventional hardware setups for consumer video conferencing inherently prevent this. During a session we tend to look at the face of the person talking, rendered in a window within the display, and not at the camera, typically located at the top or bottom of the screen. Therefore, it is not possible to make eye contact. People who use consumer video conferencing systems, such as Skype, experience this problem frequently. They constantly have the illusion that their conversation partner is looking somewhere above or below them. The lack of eye contact makes communication awkward and unnatural. This problem has been around since the dawn of video conferencing [Stokes 1969] and has not yet been convincingly addressed for consumer-level systems.
While full gaze awareness is a complex psychological phenomenon [Chen 2002; Argyle and Cook 1976], mutual gaze or eye contact has a simple geometric description: the subjects making eye contact must be in the center of their mutual line of sight [Monk and Gale 2002]. Using this simplified model, the gaze problem can be cast as a novel view synthesis problem: render the scene from a virtual camera placed along the line of sight [Chen 2002]. One way to do this is through the use of custom-made hardware setups that change the position of the camera using a system of mirrors [Okada et al. 1994; Ishii and Kobayashi 1992]. These setups are usually too expensive for a consumer-level system.
The alternative is to use software algorithms to synthesize an image from a novel viewpoint different from that of the real camera. Systems that can convincingly do novel view synthesis typically consist of multiple camera setups [Matusik et al. 2000; Matusik and Pfister 2004; Zitnick et al. 2004; Petit et al. 2010; Kuster et al. 2011] and proceed in two stages. In the first stage they reconstruct the geometry of the scene and in the second stage, render the geometry from the novel viewpoint. These methods require a number of cameras too large to be practical or affordable for a typical consumer. They have a convoluted setup and are difficult to run in real-time.
With the emergence of consumer-level depth and color cameras such as the Kinect [Microsoft 2010] it is possible to acquire in real-time both color and geometry. This can greatly facilitate solutions to the novel view synthesis problem, as demonstrated by Kuster et al. [2011]. Since already over 15 million Kinect devices have been sold, technology experts predict that soon the depth/color hybrid cameras will be as ubiquitous as webcams and in a few years will even be available on mobile devices. Given the recent overwhelming popularity of such hybrid sensors, we propose a setup consisting of only one such device. At first glance the solution seems obvious: if the geometry and the appearance of the objects in the scene are known, then all that needs to be done is to render this 3D scene from the correct novel viewpoint. However, some fundamental challenges and limitations should be noted:                The available geometry is limited to a depth map from a single viewpoint. As such, it is very sensitive to occlusions, and synthesizing the scene from an arbitrary (novel) viewpoint may result in many holes due to the lack of both color and depth information, as illustrated in FIG. 2 (left). It might be possible to fill these holes in a plausible way using texture synthesis methods, but they will not correspond to the true background.        The depth map tends to be particularly inaccurate along silhouettes and will lead to many flickering artifacts.        Humans are very sensitive to faces, so small errors in the geometry could lead to distortions that may be small in a geometric sense but very large in a perceptual sense.        
Gaze correction is a very important issue for teleconferencing and many experimental and commercial systems support it [Jones et al. 2009; Nguyen and Canny 2005; Gross et al. 2003; Okada et al. 1994]. However, these systems often use expensive custom-made hardware devices that are not suitable for mainstream home use. Conceptually, the gaze correction problem is closely related to the real-time novel-view synthesis problem [Matusik et al. 2000; Matusik and Pfister 2004; Zitnick et al. 2004; Petit et al. 2010; Kuster et al. 2011]. Indeed if a scene could be rendered from an arbitrary viewpoint then a virtual camera could be placed along the line of sight of the subject and this would achieve eye contact. Novel view synthesis using simple video cameras has been studied for the last 15 years, but unless a large number of video cameras are used, it is difficult to obtain high-quality results. Such setups are not suitable for our application model that targets real-time processing and inexpensive hardware.
There are several techniques designed specifically for gaze correction that are more suitable for an inexpensive setup. Some systems only require two cameras [Criminisi et al. 2003; Yang and Zhang 2002] to synthesize a gaze-corrected image of the face. They accomplish this by performing a smart blending of the two images. This setup constrains the position of the virtual camera to the path between the two real cameras. More importantly, the setup requires careful calibration and is sensitive to light conditions, which makes it impractical for mainstream use.
Several methods use only one color camera to perform gaze correction. Some of these [Cham et al. 2002] work purely in image space, trying to find an optimal warp of the image, and are able to obtain reasonable results only for very small corrections. This is because without some prior knowledge about the shape of the face it is difficult to synthesize a convincing image. Thus other methods use a proxy geometry to synthesize the gaze-corrected image. Yip et al. [2003] uses an elliptical model for the head and Gemmell [2000] uses an ad-hoc model based on the face features. However, templates are static and faces are dynamic. So a single static template will typically fail to do a good job when confronted with a large variety of different facial expressions.
Since the main focus of many of these methods is reconstructing the underlying geometry of the head or face, the emergence of consumer-level depth/color sensors such as the Kinect, giving easy access to real-time geometry and color information, is an important technological breakthrough that can be harnessed to solve the problem. Zhu et al. [2011] proposed a setup containing one depth camera and three color cameras and combined the depth map with a stereo reconstruction from the color cameras. However this setup only reconstructs the foreground image and still is not inexpensive.