The signal recorded at a microphone incorporates the portion of the signal traveling directly (“direct” part of the signal) from the target source s as well as delayed and possibly modified copies of the signal that reach the microphone after interaction with the environment. The recorded signal is also contaminated with noise n and the sound that is created by other “distracter” sources ti. In many applications, it is desired to preferentially extract the target source signal s from the recording while suppressing both noise and distracters. The influence of the room or environment on the target and distracter sources may be modeled via individual source-to-microphone Room Impulse Responses (RIR) h. According to the RIR model, the received signal at a microphone mj in an N element array may be characterized as
                                          m            j                    =                                    s              *                              h                sj                                      +                                          ∑                i                            ⁢                                                          ⁢                                                t                  i                                *                                  h                  tij                                                      +                          n              j                                      ⁢                                  ⁢                              j            =            1                    ,                      …            ⁢                                                  ⁢                          N              .                                                          (                  Eq          .                                          ⁢          1                )            where the sign “*” denotes a convolution operation applied to the affected functions in the Eq. 1.
A number of algorithms exist which focus only on the “direct” part of the signal, and the distracters are often considered just to be the part of the noise term. In this casemj=s+s*hsj′+Nj,j=1, . . . N.  (Eq. 2)where hj′ is hsj with the direct path removed and Nj is the overall noise in the room.
In this situation, the ability to recover s depends upon the structure of hsj′; and the amount of noise. If the “direct” path is a substantial fraction of the received signal, then s may be recovered. When the environment is very reverberant, s is likely to be very correlated with s*hsj′, and the recovery of s is difficult.
In order to alleviate this difficulty, a room filter may be estimated and applied to the signal. A linear time-domain filter known as Room Impulse Response (RIR) may characterize the effect of the environment on the signal. The RIR may be computed using simple geometric computations, advanced ray-tracing techniques, or numeric methods for the complicated scatter shapes. However, these computations are extremely expensive to implement.
Additionally, the RIR inversion in an attempt to derive a deconvolution filter is a numerically unstable procedure, and in order to derive useful results, the RIR computation is to be performed with high accuracy, which is practically impossible to achieve in realistic environments.
Furthermore, any source displacement by as much as a few centimeters requires the RIR recomputation in order to keep higher frequencies coherent. The RIR recomputation requires the three-dimensional position information which is difficult to obtain in realistic environments.
Another approach aimed to enhance the desired signal in a digital mixture through the use of the spatial sound processing (beamforming) with a microphone array is presented in B. D. Van Veen, et al., “Beamforming: A versatile approach to spatial filtering”, IEEE ASSP Magazine 1988, v. 5, pp. 4-24. In a microphone array, several microphones are placed in a number of locations in space, and the signals arriving at the microphone array are filtered and summed so that the signals originating from a desired location (e.g., a signal source) are amplified compared to the rest of signal.
In the case of the microphone array, the RIR is specific for each microphone in the array. Beamforming usually assumes that the location of interest is given and requires recomputation of the filters with a change of location. Some approaches also adaptively modify the filters in order to suppress unwanted interference, where the interference is broadly defined as signals uncorrelated with the source signal. This approach is obviously ineffective in removing the reverberant parts of the signal.
Due to complications associated with exact tracking of the target source in a three-dimensional environment, in many applications, such as source localization and speech recognition, the reverberative patterns imposed by the environment are undesirable. A Matched Filter Array (MFA) processing is an example of how the reverberation may be used constructively.
MFA processing is a combination of beamforming and RIR deconvolution. MFA may be considered as beamforming aimed not only at the sound source itself but also at its reflections. In order to perform MFA processing, knowledge of the RIR for each microphone in the array is necessary. It may be either computed analytically using a room model and source/receiver positions, or may be measured in the actual environment where the beamforming has to be applied.
An MFA analog of a simple delay-and-sum beamforming is obtained by truncating and inverting the RIR and inserting fixed time delay to make the resulting filter causal. In a simulated multi-path environment, the Signal-to-Noise Ratio (SNR) of the beamformer remains independent of a number of propagation paths, as these are compensated automatically by inverse filtering. However, accurate knowledge of RIR is still necessary for processing, as the MFA performance degrades quickly with RIR inaccuracies caused by uncertainty in the source position.
Microphone arrays, such as spherical microphone arrays, provide an opportunity to study a complete spatial characteristic of the sound received at a particular location. Over the past years, there have been several publications that report the use of spherical microphone arrays (e.g., J. Meyer, et al., “A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield,” Proc. IEEE ICASSP, 2002; B. Rafaely, “Analysis and Design of Spherical Microphone Arrays,” IEEE Trans. Speech and Audio Proc. 13: 135-143, 2005; Z. Li et al., “Flexible layout and optimal cancellation of the orthonormality error for spherical microphone arrays,” Proc. IEEE ICASSP 2004). Such arrays are seen by some researchers as a means to capture a representation of the sound field in the vicinity of the array, or as a means to digitally beamform sound from different directions using the array with a relatively high order beampattern.
In particular, it is possible to create audio images using the spherical array. This may be done by digitally “steering” the beamformer at many combinations of elevation and azimuth angles (θ,φ) in the spherical coordinate system and representing the beamformer output power as an image. Audio camera using microphone arrays for real-time capture of audio images and method for jointly processing the audio images with video images are described in the Patent Application Publication No. 2009/0028347 authored by Duraiswami, et al. The subject matter of the document overlaps with A. E. O'Donovan, et al., “Microphone arrays as generalized cameras for integrated audio visual processing” (Proc. IEEE CVPR 2007) which shows that the images created, similarly to regular visual camera images, are “central-projection” images. In the same paper the technique for joint calibration of audio and video cameras were suggested, and the applications of the joint analysis of audio and visual images were introduced.
In addition, in A. E. O'Donovan, et al., “Real time capture of Audio images and their use with video,” Proc. IEEE WASPAA, 2007, it was shown how the audio cameras operate at frame-rate by using the parallel nature of the computations, efficient factorizations, and the availability of extreme amounts of processing power in modern graphical processors.
Also, in A. E. O'Donovan, et al., “Imaging Concert Hall acoustics using visual and audio cameras,” Proc. IEEE ICASSP, 2008, it was shown how the audio camera, with its output mapped to a visual camera image, could be used to analyze the reverberant structure of concert halls.
Therefore, it would be desirable to provide a computationally effective algorithm to compute automatically, and in real-time, the room impulse response through the processing of the audio images of signals co-registered with the space of interest acquired by the spherical audio camera.