The present invention relates to audio signal processing and, in particular, to an apparatus and method for identifying a loudspeaker-enclosure-microphone system.
Spatial audio reproduction technologies become increasingly important. Emerging spatial audio reproduction technologies, such as wave field synthesis (WFS) (see [1]) or higher-order Ambisonics (see [2]) aim at creating or reproducing acoustic wave fields that provide a perfect spatial impression of the desired acoustic scene in an extended listening area. Reproduction technologies like WFS or HOA provide a high-quality spatial impression to the listener, utilizing a large number of reproduction channels. To this end, typically, loudspeaker arrays with dozens to hundreds of elements are used. The combination of these techniques with spatial recording systems opens up new fields of applications such as immersive telepresence and natural acoustic human/machine interaction. To obtain a more immersive user experience, such reproduction systems may be complemented by a spatial recording system to approach new application fields or to improve the reproduction quality. The combination of the loudspeaker array, the enclosing room and the microphone array is referred to as loudspeaker-enclosure-microphone system and is identified in many application scenarios by observing the present loudspeaker and microphone signals. As an example, the local acoustic scene in a room is often recorded in a room where another acoustic scene is played back by a reproduction system.
However, the desired microphone signals of the local acoustic scene cannot be observed without the echo of the loudspeakers in such scenarios. In a teleconference, the resulting signals would annoy the far-end party [3], while a speech recognizer in a voice-based human/machine front end will generally exhibit poor recognition rates [4]. Acoustic echo cancellation (AEC) is commonly used to remove the unwanted loudspeaker echo from the recorded microphone signals while preserving the desired signals of the local acoustic scene without quality degradation. To this end, the loudspeaker-enclosure-microphone system (LEMS) is modeled by an adaptive filter which produces an estimate of the loudspeaker echos contained in the microphone signals which is subtracted from the actual microphone signals. This task comprises an identification of the LEMS, ideally leading to a unique solution. In the following, the term LEMS refers to a MIMO LEMS (Multiple-Input Multiple-Output LEMS).
AEC is significantly more challenging in the case of multichannel (MC) reproduction compared to the single-channel case, because the nonuniqueness problem [5] will generally occur: Due to the strong cross-correlation between the loudspeaker signals (e.g., those for the left and the right channel in a stereo setup), the identification problem is ill-conditioned and it may not be possible to uniquely identify the impulse responses of the corresponding LEMSs [6]. The system identified instead, denotes only one of infinitely many solutions defined by the correlation properties of the loudspeaker signals. Therefore the true LEMS is only incompletely identified. The nonuniqueness problem is already known from the stereophonic AEC (see, e.g. [6]) and becomes severe for massive multichannel reproduction systems like, e. g., wavefield synthesis systems.
An incompletely identified system still describes the behavior of the true LEMS for the present loudspeaker signals and may therefore be used for different adaptive filtering applications, although the identified impulse responses may differ from the true impulse responses. In the case of AEC, the obtained impulse responses describe the LEMS sufficiently well to significantly suppress the loudspeaker echo.
However, when the cross-correlation properties of the loudspeaker signals change, this is no longer true and the behavior of systems relying on adaptive filters may in fact be uncontrollable. When there is a change in the cross-correlation of the loudspeaker signals, a breakdown of the echo cancellation performance is the typical consequence. This lack of robustness constitutes a major obstacle for the application of MCAEC. Moreover, other applications, such as listen room equalization (also called listening room equalization) or active noise cancellation (also called active noise control) do also rely on a system identification and are strongly affected in a similar way.
To increase robustness under these conditions, the loudspeaker signals are often altered to achieve a decorrelation so that the true LEMS can be uniquely identified. A decorrelation of the loudspeaker signals is a common choice.
For this purpose, three options are known: Adding mutually independent noise signals to the loudspeaker signals [5,7,8] different nonlinear preprocessing [6,9] or differently time-varying filtering [10,11] for each loudspeaker signal. Although perfect solutions are unknown, a time-varying phase modulation has been shown to be applicable even to high-quality audio. [11]. While the mentioned techniques should ideally not impair the perceived sound quality, an application of these approaches for the mentioned reproduction techniques might not be an optimum choice: As the loudspeaker signals for WFS and HOA are analytically determined, time-varying filtering might significantly distort the reproduced wave field and when aiming at high-quality audio reproduction, a listener will probably not accept the addition of noise signals or non-linear preprocessing.
There might be scenarios where an alteration of the loudspeaker signals is unwanted or impractical. An example is given by WFS, where the loudspeaker signals are determined according to the underlying theory and a deviation in phase would distort the reproduced wave field. Another example is the extension of reproduction systems, where the loudspeaker signals are observable, but cannot be altered. However, in such cases it is still possible to mitigate the consequences of the nonuniqueness problem by heuristic approaches to improve the system description. Such heuristics can be based on knowledge about the transducer positions and the resulting impulse responses of the LEMS. For a stereophonic AEC in a symmetric array setup this was proposed by Shimauchi et al. [12], assuming that the symmetric array setup results in a symmetry of the impulse responses for the corresponding loudspeaker-to-microphone paths.
Allowing no alteration of the loudspeaker signals, it is still possible to improve system description when the nonuniqueness problem occurs, although this possibility has barely been investigated in the past. To this end, knowledge of the LEMS geometry can be used to derive additional constraints to choose an improved solution for the system description in a heuristic sense. One such approach was presented in [12] where the symmetry of a stereophonic array setup was exploited accordingly.
However, in [12] no solution is presented for systems with large numbers of loudspeakers and microphones, such as loudspeaker-enclosure-microphone systems.
Wave-domain adaptive filtering was proposed by Buchner et al. in 2004 for various adaptive filtering tasks in acoustic signal processing, including multichannel acoustic echo cancellation (MCAEC) [13], multichannel listening room equalization [27] and multichannel active noise control [28]. In 2008, Buchner and Spors published a formulation of the generalized frequency-domain adaptive filtering (GFDAF) algorithm [15] with application to MCAEC [14] for the use with wave-domain adaptive filtering (WDAF), however, disregarding the nonuniqueness problem [15].