When conducting a video conference visual signals are generated and transmitted from one end of the call to the other end(s) along with auditory signals, so that when one or more conference participants are speaking, the sound produced on the other end(s) should be synchronized and played simultaneously. There are two kinds of discrepancies which may exist in a video conference between the audio and video: discrepancies in time and spatial congruency. Discrepancies in time between audio and video streams lead to synchronization problems, for example the vocal utterances (e.g., voices) from the speaking participants may not be synchronized with each participant's mouths. Spatial congruency, on the other hand describes how much the sound field being played matches the visual scene being displayed. Alternatively, spatial congruency may define a degree of alignment between an auditory scene and a visual scene. The example embodiments described herein aim to adjust spatial congruency in a video conference so that the auditory scene and the visual scene are matched with each other, thereby presenting an immersive video conferencing experience for the participants on multiple ends.
Users need not be concerned about the above described spatial congruency problem if the audio signal is in mono format which is commonly adopted in most of existing video conferencing systems. However, if at least two channels are employed (e.g., stereo) spatial congruency may occur. Nowadays, sound can be captured by more than two microphones, which would be transmitted in a multi-channel format, such as 5.1 or 7.1 surround formats, and rendered and played by multiple transducers by the end user(s). In a typical conference environment, there are several participants surrounding a device for capturing their voices and each of the participants can be seen as a single audio object which generates a series of audio signals upon speaking.
As used herein, the term “audio object” refers to an individual audio element that exists for a defined duration in time in the sound field. An audio object may be dynamic or static. For example, a participant may walk around the audio capture device and the position of the corresponding audio object varies accordingly.
For video conferences and various other applications involving spatial congruency issues, incongruent auditory-visual rendition leads to an unnatural percept which could cause a degraded conferencing experience. In general, a discrepancy less than 5° can be seen as acceptable because such a difference in the angle is not significantly noticeable to most users. If the discrepancy in the angle is more than 20°, most users find it to be noticeably unpleasant.
In view of the foregoing, there is a need in the art for a solution for adjusting the auditory scene to be aligned with the visual scene or adjusting the visual scene to be aligned with the auditory scene.