Video conferencing is an established method of collaboration between remotely located participants. A video image of a remote environment is broadcast onto a local display allowing a local user to see and talk to one or more remotely located participants.
One of the problems associated with video conferencing in the past is the lack of eye contact between participants. Typically, participants interact with the display for communicative purposes instead of the recording camera that is positioned to capture a video image of the local participant. For example, a display associated with a local user displays a video image of a remote participant. Interest of the local user is focused primarily on the display for communication. Users typically interact with images of participants on the display by talking to and gazing at the participants located on the display.
Since the recording camera can not physically be positioned exactly at the location of interest of the local participant, the center of the display, the remote participant will not see a face-on view of the local participant. The local user appears to be avoiding eye contact by gazing off in another direction. Moreover, the same problem exists at the display of the local user. The local user also views a video stream of the remote participant that is not face-on.
One approach to solving the problem of the lack of eye contact is to display a model of the facial features of a participant in a static synthetic environment. In this way, a model of the participant is created that is face-on. The model is created by substituting various facial features of the participant for the actual state of the facial feature. The facial features that are substituted in can be created from sample images of the participant. For example, a mouth can be in various emotional states, such as, a smile, or open denoting surprise. As the participant smiles and frowns in real time, a model is generated to reflect the smiles and frowns and is presented to the other participants.
However, the limitation of this approach is evident especially when errors in interpreting the visual features of the participant are displayed. For example, if a mouth of a participant is actually smiling, but a pursed lip is detected, then the model of the participant would display a pursed lip. Although the accompanying audio stream and other facial features may not correspond to the errant display of a pursed lip, the model would not appear to be in error, since all the facial features are reconstructed from actual images. A viewer of the model therefore would be unable to determine if an error occurred.
Moreover, because of current limitations of computing resources, in order for real-time display of the model to occur for real time video conferencing, for simplicity only a few facial features are reconstructed for exhibiting emotion. As such, the model may not appear realistic to the viewing participant. Also, since the model is reconstructed from sample images, the actual image of the participant is not transmitted to the viewing participant. As such, subtleties in emotion exhibited by the facial features cannot be represented by a model if the sample images do not contain the emotion, or if the reconstructed facial features do not encompass the facial feature of interest.
Another problem associated with previous implementations of video conferencing is the need for dedicated video conferencing facilities. Typically, all participants at one end (e.g., remote or local end) of the video conference need to be present within the same shared camera field of view for presentation to the other participants. This requires all the participants at one end to meet in a dedicated facility. Since equipment at the dedicated facility may be quite extensive and cost prohibitive, not everyone or every company has immediate access to the dedicated video conference facility. As such, video conferences must be scheduled in advance and participants must travel to the video conference facility, which precludes any impromptu video conferencing capabilities.
Also, since all the participants on one end of the video conference share an image of the remote scene, and there is only one audio channel for all participants, video conferencing in the past is most appropriate for one person talking at a time, while all the other participants listen. Simultaneous speakers over the audio channel would interfere with each other, such as, those holding side conversations. As such, all the participants are limited to participating in the main conversation of the video conference, which precludes any simultaneous conversations, or side conversations.
Therefore, prior art methods of video conferencing were unable to satisfactorily provide for adequate eye to eye contact between remotely located participants. Moreover, prior art techniques are unable to realistically show real-time emotional states of the participants in a video conference. Additionally, the inconvenience associated with the use of dedicated video conferencing facilities negates any impromptu video conferencing.