In recent years, multimedia telecommunication systems capable of capturing and rendering audio-visual scenes of people at different locations have drawn significant attention, such as multimedia systems enabling people to take part in video conferences. This in turn has lead to an interest in localizing and tracking people and their speaking activity for two primary reasons. First, with regard media processing, determining a speaker's location can be useful for selecting a particular camera or to steer a camera to record the speaker's movements, to enhance the audio stream via microphone-array beamforming for e.g., speech recognition, to provide accumulated information for person identification, and to recognize location-based events, such as a presentation. Second, with regard to human interaction analysis, social psychology has highlighted the role of non-verbal behavior, such as facial expressions in interactions, and the correlation between speaker turn patterns and aspect of the behavior of a group. Extracting cues to identify such multimodal behaviors requires reliable speaker localization and tracking capabilities.
However, typical systems for capturing audio-visual scenes rely on controlled environments that can be expensive to build because of acoustic and/or controlled lighting requirements. On the other hand, in uncontrolled environments, the quality of captured audio-visual scenes deteriorates dramatically and often hinders a system's ability to support seamless collaboration among people at different locations.
Thus, systems and methods for capturing audio-visual scenes in high quality and extracting useful localization and tracking information of speaking people are desired.