A primary concern with video teleconferencing systems is the frequent lack of eye contact between participants. In the most common configuration, each participant uses a computer monitor on which an image of the second participant is displayed, while a camera mounted above the monitor captures the image of the local participant for display on the monitor of the second participant. Since participants frequently look at the monitor, either at the image of the second participant or elsewhere on the display, rather than directly at the video camera, there is the appearance that the participants are not looking at one another, resulting in an unsatisfactory user experience.
Many prior art solutions to the eye contact problem have incorporated half-silvered, partially transmissive and partially reflective mirrors, or beamsplitters.
These solutions have typically incorporated a beamsplitter placed in front of a computer display at a 45 degree angle. In one typical configuration, a video camera, located behind the beamsplitter, captures an image of the local participant through the beamsplitter. The local participant views an image of the second participant on the display as reflected by the beamsplitter.
In devices incorporating a conventional CRT, the resulting device is both aesthetically bulky and physically cumbersome. Furthermore, in cases involving an upward facing display, the display is viewable both directly and as reflected by the beamsplitter, greatly distracting the local participant. To alleviate this problem, prior solutions, including those described in U.S. Pat. Nos. 5,117,285 and 5,612,734 have introduced complicated systems involving polarizers or micro-louvers to obstruct a direct view of the upward facing display by the local participant. In all cases, the image of the second participant appears recessed within the housing holding the display, beamsplitter, and video camera. The resulting distant appearance of the second participant greatly diminishes the sense of intimacy sought during videoconferencing.
Another series of prior art attempts to alleviate this problem through the use of computational algorithms that manipulate the transmitted or received video image. For example, U.S. Pat. No. 5,500,671 describes a system that addresses the eye contact problem by creating an intermediate three-dimensional model of the participant based on images captured by two imaging devices on either side of the local display. Using this model, the system repositions artificially generated eyes at an appropriate position within the image of the local participant transmitted to the second participant. The resulting image, with artificially generated eyes and a slight but frequent mismatch between the position of the eyes relative to the head and body of the participant, is unnatural in appearance. Furthermore, the creation of an intermediate three-dimensional model is computationally intensive, making it difficult to implement in practice.
U.S. Pat. No. 5,359,362 describes a system “using at each station of a video conferencing system at least a pair of cameras, neither of which is on the same optical axis as the local monitor, to obtain a three-dimensional description of the speaker and from this description obtaining for reproduction by the remote monitor at, the listener's station a virtual image corresponding to the view along the optical axis of the camera at the speaker's station. The partial 3D description at the scene can be used to construct an image of the scene from various desired viewpoints. The three dimensional description is most simply obtained by viewing the scene of interest, by a pair of cameras, typically preferably aligned symmetrically on either left and right or above and below, about the optical axis of the monitor, solving the stereo correspondence problem, and then producing the desired two dimensional description of the virtual image for use by the monitor at the listener's station.
. . . (The) process of creating the desired two-dimensional description for use as the virtual image consists of four steps, calibration, stereo matching, reconstruction and interpolation. The calibration converts the view from two tilted cameras into two parallel views important for stereo matching. The stereo matching step matches features, such as pixels, between the two views to obtain a displacement map that provides information on the changes needed to be made in one of the observed views. The reconstruction step constructs the desired virtual view along the axis between the two cameras from the displacement map and an observed view, thereby recovering eye contact. The final step is to fill in by interpolation areas where complete reconstruction is difficult because of gaps in the desired virtual view that result from limitations in the displacement map that was formed.”
Note that U.S. Pat. No. 5,359,362 generates its virtual image by transforming the image obtained by one of the two physical imaging devices. The resulting image does not reflect any features of the local participant that are occluded from the transformed image.
Still other prior art approaches construct a complete mathematical model of the local participant and his nearby surroundings. This mathematical model is then transmitted to the second participant, where it is reconstructed in a manner providing eye contact. Clearly, such systems require that both the remote and local communicants own and operate the same videoconferencing device. This presents a significant obstacle to introduction and widespread adoption of the device.
Consider the prior art as found in U.S. Pat. No. 5,359,632 again. Often, in such stereo matching systems, prior to beginning real-time video conferencing image processing, a calibration operation is used to obtain information describing the positioning and optical properties of the imaging devices. First a camera projection matrix is determined for each of the imaging devices. This camera projection matrix characterizes the correspondence of a point in three-dimensional space to a point in the projective plane imaged by the video camera. The matrix determined is dependent on the position and angular alignment of the camera as well as the radial distortion and zoom factor of the camera lens. One prior art approach employs test patterns and a camera calibration toolbox developed by Jean-Yves Bouguet at the California Institute of Technology. This calibration toolbox draws upon methods described in the papers entitled “Flexible Camera Calibration by Viewing a Plane from Unknown Orientations” by Zhang, “A Four-step Camera Calibration Procedure with Implicit Image Correction” by Heikkila and Silven, “On Plane-Based Camera Calibration: A General Algorithm, Singularities, Applications” by Sturm, and “A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses” by R. Y. Tsa and Maybank.
Following the determination of these camera projection matrices, a two dimensional rectifying transform is determined for each of the pair of imaging devices. The transformation may be determined based on the previously determined camera projection matrices, using an approach described in the paper of Fusiello, Trucco, and Verri entitled “Rectification with unconstrained stereo geometry”. The transformation, when applied to a pair of images obtained from the imaging devices, produces a pair of rectified images. In such a set of images, each pixel in a first video camera image corresponds to a pixel in the second image located along a line at the same vertical location as the pixel in the first image.
The prior art also includes calculating a dense correspondence between the two generated camera images. Several algorithms are available for determining such a dense correspondence including the method described in the paper of Georges M. Quenot entitled “The ‘Orthogonal Algorithm’ for Optical Flow Detection Using Dynamic Programming”. The Abstract states “This paper introduces a new and original algorithm for optical flow detection. It is based on an iterative search for a displacement field that minimizes the L1 or L2 distance between two images. Both images are sliced into parallel and overlapping strips. Corresponding strips are aligned using dynamic programming exactly as 2D representations of speech signal are with the DTW algorithm. Two passes are performed using orthogonal slicing directions. This process is iterated in a pyramidal fashion by reducing the spacing and width of the strips. This algorithm provides a very high quality matching for calibrated patterns as well as for human visual sensation. The is results appears to be at least as good as those obtained with classical optical flow detection methods.”
What is needed is a method for efficient real-time processing of at least two spatially offset image sequences to create a virtual image sequence providing a sense of eye contact, which is of great value in a number of applications including, but not limited to, video conferencing. The sense of eye contact should operate effectively across the full range of local participant head positions and gaze directions. It must provide a natural view of the local participant for the second participant. It must be aesthetically pleasing and easily operated by a typical user. What is further needed is apparatus efficiently interfacing to a standard video conferencing system and providing the advantages of such methods of generating virtual image sequences.