Signal localization is used in several applications. The most widely known application is perhaps TV program production. In televised debate programs, for example, it is important for the viewer's experience and intelligibility that the active camera is pointing at, and preferably zooming in on, the current speaker. However, this has traditionally been handled manually by a producer. In other applications where cameras and microphones are capturing the view and sound of a number of people, it might be impossible or undesirable to have a dedicated person to control the performance.
One example of an application where cameras and microphones are capturing the view and sound of a number of people is automatic camera pointing in video conferencing systems. A typical situation at an end-point in a video conference call is a meeting room with a number of participants sitting around a table watching the display device of the end-point, while a camera positioned near the display device is capturing a view of the meeting room. If there are many participants in the room, it may be difficult for those who are watching the view of the meeting room at a far end side to determine the speaker or to follow the speaker's arguing. Thus, it would be preferable to localize the active speaker in the room, and automatically point and/or zoom the camera onto that participant. Automatically orienting and zooming of a camera given a certain position within reach of the camera, is well known in the art, and will not be discussed in detail. The problem is to provide a sufficiently accurate localisation of the active speaker, both in space and in time, in order to allow acceptable automatic video conference production.
Known audio source localization arrangements use a plurality of spatially spaced microphones, and are often based on the determination of a delay difference between the signals at the outputs of the receivers. If the positions of the microphones and a delay difference between the propagation paths between the source and the different microphone are known, the position of the source can be determined.
One example of an audio source localisation is shown in U.S. Pat. No. 5,778,082, which is incorporated herein by reference. This patent teaches a method and a system using a pair of spatially separated microphones to obtain the direction or location of an audio source. By detecting the beginning of the respective signals of the microphones representing the sound of the same audio source, the time delay between the audio signals may be determined, and the distance and direction to the audio source may be calculated.
If three microphones are used, it becomes possible to determine a position of the source in a 2-D plane. If more than three microphones, not placed in a single plane, are used, it is possible to determine the position of a source in three dimensions. A common assembly is the placement of one array of microphones in the horizontal direction below the camera, and one single microphone above the camera. This allows both horizontal and vertical source localization. The microphone mounted above the camera may be very dominant visually, may be exposed to potential damage, and may introduce extra manufacturing costs. A solution were a microphone is integrated into a top of the camera itself is therefore preferable. This allows for vertical localization of a source without having a microphone mounted on, e.g., a rod above the camera. The microphone is invisible, well protected, provides possibilities for less intrusive design, and is more visually pleasing. As indicated above, this preference has some disadvantages due to smaller distances between the microphones, implying less accuracy in the signal source tracking.
In addition, sound quality is an important issue for source tracking applications, as good signal quality is also necessary for tracking accuracy. When outputs from several microphones are combined, precise and repeatable frequency response is required in both amplitude and phase, and in wide band. Matching requirements can be lower than 1 dB and a few degrees of phase. This is not found in normal transducer production, not even with high cost microphones. Therefore, manufacturers have to match transducers by measurement and sorting, or product designers must incorporate some form of measurement and calibration of individual microphones. Both alternatives are costly.
Another alternative that recently emerged is to utilize new microphone technology, MEMS (Micro Electro Mechanical Systems). MEMS microphones are produced using silicon wafer technology, and the process gives microphones with variation in phase response that is significantly less than regular ECM (Electret Condenser Microphone) microphones. They are well suited for applications where good phase matching of microphones is required.
Microphone self-noise is, however, a real problem, especially with cheap ECMs. MEMS microphones have even higher self noise than standard ECMs, which is a problem for sound pickup and localization systems, especially at high frequencies.
Background noise in rooms typically has decreasing power with increasing frequency, while many cheap microphone types have a constant or increasing self-noise power with increasing frequency. Speech signals have very low power in the high frequencies, but the high frequency content is still important for natural sounding speech recordings and also provides very effective cues for source localization algorithms. In the high frequencies, the microphone self-noise is the dominant noise contributor, and this limits the signal to noise ratio (SNR) when capturing speech in rooms. This is especially true when microphones cannot be employed close to the persons talking, and limits the potential for localization and tracking algorithms. Analysis algorithms are disturbed because high frequency information in speech is masked by microphone self-noise.