1. Field of the Invention
The present invention relates to signal source localization, in particular an arrangement and method of spatially localizing active speakers in a video conference.
2. Discussion of the Background
Signal localization is used in several applications. The most known application is perhaps TV program production. For example, in debate programs, it is important for the viewer's experience and intelligibility that the active camera is pointing at, and preferably zooming on, the current speaker. However, this has traditionally been handled manually by a producer. In other applications where cameras and microphones are capturing the view and sound of a number of people, it might be impossible or undesirable to have a dedicated person to control the performance.
One example of such application is automatic camera pointing in video conferencing systems. A typical situation at an end-point in a video conference call is a meeting room with a number of participants sitting around a table watching the display device of the end-point, while a camera positioned near the display device is capturing a view of the meeting room. If there are many participants in the room, it may be difficult for those who are watching the view of the meeting room at a far end side to determine the speaker or to follow a discussion between several speakers. Thus, it would be preferable to localize the active speaker in the room, and automatically point and/or zoom the camera onto that participant. Automatically orienting and zooming of a camera given a certain position within reach of the camera, is well known in the art, and will not be discussed in detail. The problem is to provide a sufficiently accurate localization of the active speaker, both in space and in time, in order to allow acceptable automatic video conference production.
Known audio source localization arrangements use a plurality of spatially spaced microphones, and are often based on the determination of a delay difference between the signals at the outputs of the receivers. If the positions of the microphones and a delay difference between the propagation paths between the source and the different microphone are known, the position of the source can be determined. If two microphones are used, it is possible to determine the direction with respect to the baseline between them. If three microphones are used, it becomes possible to determine a position of the source in a 2-D plane. If more than three microphones, not placed in a single plane, are used, it becomes possible to determine the position of a source in three dimensions.
One example of audio source localization is shown in U.S. Pat. No. 5,778,082. This patent teaches a method and a system using a pair of spatially separated microphones to obtain the direction or location of an audio source. By detecting the beginning of the respective signals of the microphones representing the sound of the same audio source, the time delay between the audio signals may be determined, and the distance and direction to the audio source may be calculated.
In these and other known solutions to audio localization the microphones used for direction and distance calculations are placed close to the camera. The camera is usually placed on top of the screen, beyond the end of the conference table. At least some of the participants will be seated at a long distance (r) from the microphone setup. This setup has some disadvantages as discussed below.
Due to the long distance between the speakers and the microphone setup, the expected spread of direction angles is small, and the spread of sound arrival time differences is correspondingly small. This reduces the accuracy of the localization algorithm. However, due to the long distance r, the algorithm should be precise.
One way of increasing the time arrival differences is to increase the distance between the microphones, denoted d. However, prior art has shown that d can not be increased too much, as the signals into the different microphones tend to get uncorrelated with too large d. Prior art has shown that a distance d of 20-25 cm provides the best results.
In particular, the calculation of the distance is prone to errors in traditional systems, as this distance is calculated using a minor angle difference between relatively closely spaced microphone pairs. Thus, this method assumes that the speaker is in the near field of the microphone system, which in many cases is a questionable assumption.
The level of the direct sound (which is the sound used for calculating the direction) is inversely proportional to the distance r. Due to the long distance between the speaker and the microphones, the signal from the speaker will be weak, and therefore sensitive to background noise and self noise of the microphone and electronics.
Due to the long distance, reflections of the sound from the speaker may reach the microphone setup with almost as high level as that of the direct sound. Therefore, incorrect and inaccurate decisions can be made.
These disadvantages will always be a hindrance, but can be compensated for by integrating the audio over a long timeframe. However, this again has the disadvantage of a slowly responding system, which is a typical weakness of existing audio tracking systems.