Clarity of an image presented to a videoconferencing participant is an important aspect of videoconferencing and video-telephony systems. Attainment of sufficient clarity is particularly challenging in group videoconferencing applications, in which more than one participant is present at one or both ends of the videoconferencing session. In such cases, a camera that is used to capture the participants is typically zoomed out so that the camera can capture all the participants of the group. However, zooming out diminishes the sizes of the participants as they appear in the captured image. In other words, the number of pixels dedicated to each of the most interesting parts—the participants or other regions of interest (ROI)—are reduced. As a result, when the image is sent to the far end, the images of participants within the image are less clearly seen by the far end participants.
For example, FIG. 1 shows a captured image 101 of a conference room. An image captured by a typical camera has only a finite number of pixels, and therefore finite resolution. The camera capturing the image 101 has been zoomed out in order to include the entire area of the conference room where conference participants are likely to sit, although the three actual participants 102-104 occupy only the center region of the captured image. Most of the area in the image 101 is occupied by objects that are not interesting or necessary as far as a specific videoconference with participants 102-104 is concerned. The interesting portions of the image 101 are the faces and torsos of conference participants, e.g., 102, 103, and 104. Thus, the number of pixels capturing the ROI—the conference participants—is considerably less than the pixels capturing non-interesting regions of the image 101. When image 101 is shown to the far end participants in a typical videoconference according to prior art, the far end participants can see the images of participants 102-104 with limited clarity, because (a) the number of pixels in the image 101 is finite, (b) the videoconferencing system will typically reduce the number of pixels by down-sampling prior to transmission in order to reduce the bandwidth and computational resources required by the system, (c) the video encoding algorithms typically used are lossy, and will produce a decoded picture with less detail than the originally transmitted picture, and (d) the image displayed to the far-end participants is of finite size—therefore the image of each person may be so small that human visual acuity is able to resolve only a portion of the displayed details.
To elaborate, transmitting the image 101 to the far end can have further adverse impact on clarity. For example, compression, down-sampling, etc. may be carried out on an image to be transmitted to the far end in order to meet transmission bandwidth limits. FIG. 2 shows FIG. 1 down-sampled to QCIF resolution (176×144 pixels) for transmission to the far end. Down-sampling reduces the number of pixels for each participant to an even lower number than captured in the original image.
One traditional solution is to use multiple cameras, where each camera captures only the face/torso of one participant, and to combine the individual captured images in a so-called “Hollywood Squares” fashion to form a composite image, as shown in FIG. 3. Thus, majority of pixels of image 106 are dedicated to the images of the participants. The regions of image 101 that contained less interesting objects and features are desirably absent from image 106. Yet, while the faces of individual participants are clear, the actual spatial relationship between the participants is lost by the arrangement in image 102. For example, when participants 102 and 103 turn their heads to make eye contact with each other, the resultant motion in image 106 will appear with participant 102 turning to his left and participant 103 turning to his right. Because the image of participant 104 is to the left of the image of participant 102, it will seem as if participant 102 is turning to converse with participant 104, instead of participant 103. This can be very disorienting to the far end participants who are unaware of the relative positions of the near end participants.
Thus, it is desirable to have a technique that not only provides clearer images of the interesting regions of a captured image, but also maintains relative spatial arrangements of local participants in the captured image.