This invention relates to the field of video conferencing and in particular to methods and systems maintaining the appearance of eye contact between communicants in a teleconference.
A primary concern with video teleconferencing systems is the frequent lack of eye contact between participants. In the most common configuration, each participant uses a computer monitor on which an image of the second participant is displayed, while a camera mounted above the monitor captures the image of the local participant for display on the monitor of the second participant. Since participants frequently look at the monitor, either at the image of the second participant or elsewhere on the display, rather than directly at the video camera, there is the appearance that the participants are not looking at one another, resulting in an unsatisfactory user experience.
Many prior art solutions to the eye contact problem have incorporated half-silvered, partially transmissive and partially reflective mirrors, or beamsplitters. These solutions have typically incorporated a beamsplitter placed in front of a computer display at a 45 degree angle. In one typical configuration, a video camera, located behind the beamsplitter, captures an image of the local participant through the beamsplitter. The local participant views an image of the second participant on the display as reflected by the beamsplitter.
In devices incorporating a conventional CRT, the resulting device is both aesthetically bulky and physically cumbersome. Furthermore, in cases involving an upward facing display, the display is viewable both directly and as reflected by the beamsplitter, greatly distracting the local participant. To alleviate this problem, prior solutions, including those described in U.S. Pat. Nos. 5,117,285 and 5,612,734 have introduced complicated systems involving polarizers or micro-louvers to obstruct a direct view of the upward facing display by the local participant. In all cases, the image of the second participant appears recessed within the housing holding the display, beamsplitter, and video camera. The resulting distant appearance of the second participant greatly diminishes the sense of intimacy sought during videoconferencing.
Another series of prior art attempts to alleviate this problem through the use of computational algorithms that manipulate the transmitted or received video image. For example, U.S. Pat. No. 5,500,671 describes a system that addresses the eye contact problem by creating an intermediate three-dimensional model of the participant based on images captured by two imaging devices on either side of the local display. Using this model, the system repositions artificially generated eyes at an appropriate position within the image of the local participant transmitted to the second participant. The resulting image, with artificially generated eyes and a slight but frequent mismatch between the position of the eyes relative to the head and body of the participant, is unnatural in appearance. Furthermore, the creation of an intermediate three-dimensional model is computationally intensive, making it difficult to implement in practice.
U.S. Pat. No. 5,359,362 describes a system xe2x80x9cusing at each station of a video conferencing system at least a pair of cameras, neither of which is on the same optical axis as the local monitor, to obtain a three-dimensional description of the speaker and from this description obtaining for reproduction by the remote monitor at, the listener""s station a virtual image corresponding to the view along the optical axis of the camera at the speaker""s station. The partial 3D description at the scene can be used to construct an image of the scene from various desired viewpoints. The three dimensional description is most simply obtained by viewing the scene of interest, by a pair of cameras, typically preferably aligned symmetrically on either left and right or above and below, about the optical axis of the monitor, solving the stereo correspondence problem, and then producing the desired two dimensional description of the virtual image for use by the monitor at the listener""s station.
(The) process of creating the desired two-dimensional description for use as the virtual image consists of four steps, calibration, stereo matching, reconstruction and interpolation. The calibration converts the view from two tilted cameras into two parallel views important for stereo matching. The stereo matching step matches features, such as pixels, between the two views to obtain a displacement map that provides information on the changes needed to be made in one of the observed views. The reconstruction step constructs the desired virtual view along the axis between the two cameras from the displacement map and an observed view, thereby recovering eye contact. The final step is to fill in by interpolation areas where complete reconstruction is difficult because of gaps in the desired virtual view that result from limitations in the displacement map that was formed.xe2x80x9d
Note that U.S. Pat. No. 5,359,362 generates its virtual image by transforming the image obtained by one of the two physical imaging devices. The resulting image does not reflect any features of the local participant that are occluded from the transformed image.
Still other prior art approaches construct a complete mathematical model of the local participant and his nearby surroundings. This mathematical model is then transmitted to the second participant, where it is reconstructed in a manner providing eye contact. Clearly, such systems require that both the remote and local communicants own and operate the same videoconferencing device. This presents a significant obstacle to introduction and widespread adoption of the device.
Consider the prior art as found in U.S. Pat. No. 5,359,632 again. Often, in such stereo matching systems, prior to beginning real-time video conferencing image processing, a calibration operation is used to obtain information describing the positioning and optical properties of the imaging devices. First a camera projection matrix is determined for each of the imaging devices. This camera projection matrix characterizes the correspondence of a point in three-dimensional space to a point in the projective plane imaged by the video camera. The matrix determined is dependent on the position and angular alignment of the camera as well as the radial distortion and zoom factor of the camera lens. One prior art approach employs test patterns and a camera calibration toolbox developed by Jean-Yves Bouguet at the California Institute of Technology. This calibration toolbox draws upon methods described in the papers entitled xe2x80x9cFlexible Camera Calibration by Viewing a Plane from Unknown Orientationsxe2x80x9d by Zhang, xe2x80x9cA Four-step Camera Calibration Procedure with Implicit Image Correctionxe2x80x9d by Heikkilxc3xa4 and Silven, xe2x80x9cOn Plane-Based Camera Calibration: A General Algorithm, Singularities, Applicationsxe2x80x9d by Sturm, and xe2x80x9cA versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras and lensesxe2x80x9d by R. Y. Tsa and Maybank.
Following the determination of these camera projection matrices, a two dimensional rectifying transform is determined for each of the pair of imaging devices. The transformation may be determined based on the previously determined camera projection matrices, using an approach described in the paper of Fusiello, Trucco, and Verri entitled xe2x80x9cRectification with unconstrained stereo geometryxe2x80x9d. The transformation, when applied to a pair of images obtained from the imaging devices, produces a pair of rectified images. In such a set of images, each pixel in a first video camera image corresponds to a pixel in the second image located along a line at the same vertical location as the pixel in the first image.
The prior art also includes calculating a dense correspondence between the two generated camera images. Several algorithms are available for determining such a dense correspondence including the method described in the paper of Georges M. Quenot entitled xe2x80x9cThe xe2x80x98Orthogonal Algorithmxe2x80x99 for Optical Flow Detection Using Dynamic Programmingxe2x80x9d. The Abstract states xe2x80x9cThis paper introduces a new and original algorithm for optical flow detection. It is based on an iterative search for a displacement field that minimizes the L1 or L2 distance between two images. Both images are sliced into parallel and overlapping strips. Corresponding strips are aligned using dynamic programming exactly as 2D representations of speech signal are with the DTW algorithm. Two passes are performed using orthogonal slicing directions. This process is iterated in a pyramidal fashion by reducing the spacing and width of the strips. This algorithm provides a very high quality matching for calibrated patterns as well as for human visual sensation. The results appears to be at least as good as those obtained with classical optical flow detection methods.xe2x80x9d
What is needed is a method for efficient real-time processing of at least two spatially offset image sequences to create a virtual image sequence providing a sense of eye contact, which is of great value in a number of applications including, but not limited to, video conferencing. The sense of eye contact should operate effectively across the full range of local participant head positions and gaze directions. It must provide a natural view of the local participant for the second participant. It must be aesthetically pleasing and easily operated by a typical user. What is further needed is apparatus efficiently interfacing to a standard video conferencing system and providing the advantages of such methods of generating virtual image sequences.
To resolve the identified problems found in the prior art, the present invention creates a head-on view of a local participant, thereby enhancing the sense of eye contact provided during any of the following: a video conference session, a video phone session, a session at a video kiosk, and a video training session. Note that video conference sessions include, but are not limited to, sessions presented via one or more private communications channels and sessions presented via one or more broadcast channels.
A view morphing algorithm is applied to a synchronous collection of images from at least two video imaging devices. These images are interpolated to create interpolation images for each of the video imaging devices. The interpolated images from at least two of the video imaging devices are combined to create a composite image of the local participant. This composite image approximates a head-on view of the local participant providing excellent eye contact.
It should be noted that the synchronous image collection is comprised of images received at approximately the same time.
It is often preferred to interpolate the images to a point between the video imaging devices when they are placed in a radially symmetric manner about the local participant. It may be preferred, when the video imaging devices are not placed in a radially symmetric relationship with the local participant, that a more complex mechanism potentially involving partial extrapolation may be used to create what is identified herein as the interpolated images.
The video imaging devices are preferably placed on opposite sides of a local display and the composite image further approximates essentially what might be seen from the center of that local display.
This head-on view of the local participant supports the local participant looking directly at the monitor and provides a sense of eye contact when viewed by the second participant, actively aiding the sense of personal interaction for all participants.
Certain embodiments of the invention include, but are not limited to, various schemes supporting generation of the composite image, control of composite image generation by at least one of the second participants, and adaptively modifying the current images at certain stages based upon remembered displacements from previous images. These embodiments individually and collectively aid in improving the perceived quality of eye contact.
Aspects of the invention include, but are not limited to, devices implementing the methods of this invention in at least one of the following forms: dedicated execution engines, with or without instruction processing mechanisms; mechanisms involving table lookup of various non-linear functions; and at least one instruction processing computer performing at least some of the steps of the methods as program steps residing within memory accessibly coupled with the computer.