Video teleconferencing systems, such as, for example, PC-based systems, are becoming ubiquitous for both business and personal applications. However, such systems do not typically allow for natural eye contact between the participants, because of the angle between the camera, the user, and the video image on the monitor. Most commonly, a camera is placed on top of a monitor or off to its side, but the user is looking squarely into the center of the monitor—disadvantageously, rotated an angle of anywhere from 20 to 70 degrees from the camera lens. This is a broadly acknowledged problem in the video teleconferencing field, and a weakness in essentially all prior art telepresence (e.g., video teleconferencing) systems. In particular, eye contact has been flagged as a key differentiator for a telepresence system, and is a critical element of widespread acceptance of video telephony.
Prior art solutions vary in complexity and effectiveness. For example, certain “high-end” telepresence systems partially solve this problem by placing cameras in the center of a large screen, giving a small region of small angular error, resulting in a small region where eye contact appears to work well. However, this only applies to the few participants in the central area of the teleconferencing system, and only if they must look sideways across the cameras to the far side (e.g., if they are slightly left of their own camera, and their counterpart appears slightly to the right of the same camera, for a symmetrical system). This solution obviously isn't tenable for smaller, single monitor systems with participants closer to the screen and camera. This solution also has quirky glitches—it is often the case that the two participants will report good eye contact, but the other participants in the conference do not see these two participants as looking at each other. For example, person A may be addressing person B on the right of person C, and person B may perceive that fact, but person C will perceive that person A is addressing someone to the left of person C.
Several groups have attempted to develop algorithms that capture a person's head from multiple camera angles, construct a 3D model thereof, and then project that 3D model back to a 2D model with the necessary adjustment to redirect the gaze. Obviously, this approach requires substantial processing and is currently somewhat error prone. In addition, there are complex lighting issues involved with this approach.
Still other prior art systems use an avatar approach, in which a person's head position is captured during a calibration stage, the angular error is removed and coordinates of the head's position are transmitted as they change over time. This head position is used to draw an avatar representing the talker. Unfortunately, it is beyond the current state-of-the-art to use this information to control a realistic-looking avatar of the speaker, or to accurately capture his or her facial movements well. Current systems and those in the foreseeable future work around this by using “cartoon” style or fanciful pictures of the speaker. Although this approach gives a gross sense of the speaker's body gestures, it doesn't provide real eyes to make contact with, and any subtle, and many not-so-subtle, gestures are lost.
Finally, there have been prior art attempts to physically place the light-sensitive camera elements between the pixels of a monitor, forming a camera-monitor hybrid. However, these approaches suffer a geometry problem of their own. The photosensors take up space, and that space comes at the price of gaps in the display. Spreading the sensors out reduces the amount of light available to each sensor, making sufficient lighting a difficult problem to solve. Moreover, even if this approach can be successfully implemented, which is by no means guaranteed, users will be required to purchase a new monitor, laptop or mobile phone, and there will likely be significant video quality and/or camera quality trade-offs.