The present invention relates to an apparatus and method for multichannel direct-ambient decomposition for audio signal processing.
Audio signal processing becomes more and more important. In this field, separation of sound signals into direct and ambient sound signals plays an important role.
In general, acoustic sounds consist of a mixture of direct sounds and ambient (or diffuse) sounds. Direct sounds are emitted by sound sources, e.g. a musical instrument, a vocalist or a loudspeaker, and arrive on the shortest possible path at the receiver, e.g. the listener's ear entrance or microphone.
When listening to a direct sound, it is perceived as coming from the direction of the sound source. The relevant auditory cues for the localization and for other spatial sound properties are interaural level difference, interaural time difference and interaural coherence. Direct sound waves evoking identical interaural level difference and interaural time difference are perceived as coming from the same direction. In the absence of diffuse sound, the signals reaching the left and the right ear or any other multitude of sensors are coherent.
Ambient sounds, in contrast, are emitted by many spaced sound sources or sound reflecting boundaries contributing to the same ambient sound. When a sound wave reaches a wall in a room, a portion of it is reflected, and the superposition of all reflections in a room, the reverberation, is a prominent example for ambient sound. Other examples are audience sounds (e.g. applause), environmental sounds (e.g. rain), and other background sounds (e.g. babble noise). Ambient sounds are perceived as being diffuse, not locatable, and evoke an impression of envelopment (of being “immersed in sound”) by the listener. When capturing an ambient sound field using a multitude of spaced sensors, the recorded signals are at least partially incoherent.
Various applications of sound post-production and reproduction benefit from a decomposition of audio signals into direct signal components and ambient signal components. The main challenge for such signal processing is to achieve high separation while maintaining high sound quality for an arbitrary number of input channel signals and for all possible input signal characteristics. Direct-ambient decomposition (DAD), i.e. the decomposition of audio signals into direct signal components and ambient signal components, enables the separate reproduction or modification of the signal components, which is for example desired for the upmixing of audio signals.
The term upmixing refers to the process of creating a signal with P channels given an input signal with N channels where P>N. Its main application is the reproduction of audio signals using surround sound setups having more channels than available in the input signal. Reproducing the content by using advanced signal processing algorithms enables the listener to use all available channels of the multichannel sound reproduction setup. Such processing may decompose the input signal into meaningful signal components (e.g. based on their perceived position in the stereo image, direct sounds versus ambient sounds, single instruments) or into signals where these signal components are attenuated or boosted.
Two concepts of upmixing are widely known.    1. Guided upmix: upmixing with additional information guiding the upmix process. The additional information may be either “encoded” in a specific way in the input signal or may be stored additionally.    2. Unguided upmix: the output signal is obtained from the audio input signal exclusively without any additional information.
Advanced upmixing methods can be further categorized with respect to the positioning of direct and ambient signals. It is distinguished between the “direct/ambient-approach” and the “In-the-band”-approach. The core component of direct/ambience-based techniques is the extraction of an ambient signal which is fed e.g. into the rear channels or the height channels of a multi-channel surround sound setup. The reproduction of ambience using the rear or height channels evokes an impression of envelopment (being “immersed in sound”) by the listener. Additionally, the direct sound sources can be distributed among the front channels according to their perceived position in the stereo panorama. In contrast, the “In-the-band”-approach aims at positioning all sounds (direct sound as well as ambient sounds) around the listener using all available loudspeakers.
Decomposing an audio signal into direct and ambient signals also enables the separate modification of the ambient sounds or direct sounds, e.g. by scaling or filtering it. One use case is the processing of a recording of a musical performance which has been captured with a too high amount of ambient sound. Another use case is audio production (e.g. for movie sound or music), where audio signals captured at different locations and therefore having different ambient sound characteristics are combined.
In any case, the requirements for such signal processing is to achieve high separation while maintaining high sound quality for an arbitrary number of input channel signals and for all possible input signal characteristics.
Various approaches in the conventional technology for DAD or for attenuating or boosting either the direct signal components or the ambient signal components have been provided, and are briefly reviewed in the following.
Known concepts relates to processing of speech signals with the aim to remove undesired background noise from microphone recordings.
A method for attenuating the reverberation from speech recordings having two input channels is described in [1]. The reverberation signal components are reduced by attenuating the uncorrelated (or diffuse) signal components in the input signal. The processing is implemented in the time-frequency domain such that subband signals are processed by means of a spectral weighting method. The real-valued weighting factors are computed using the power spectral densities (PSD)ϕxx(m,k)=E{X(m,k)X*(m,k)}  (1)ϕyy(m,k)=E{Y(m,k)Y*(m,k)}  (2)ϕxy(m,k)=E{X(m,k)Y*(m,k)}  (3)where X(m,k) and Y(m,k) denote time-frequency domain representations of the time-domain input signals xt[n] and yt[n], E{⋅} is the expectation operation and X* is the complex conjugate of X.
The original authors point out that different spectral weighting functions are feasible when proportional to ϕxy(m,k), e.g. when using weights equal to the normalized cross-correlation function (or coherence function)
                              ρ          ⁡                      (                          m              ,              k                        )                          =                                                                                            Φ                  xy                                ⁡                                  (                                      m                    ,                    k                                    )                                                                                                                                        Φ                    xx                                    ⁡                                      (                                          m                      ,                      k                                        )                                                  ⁢                                                      Φ                    yy                                    ⁡                                      (                                          m                      ,                      k                                        )                                                                                .                                    (        4        )            
Following a similar rationale, the method description in [2]extracts an ambient signal using spectral weighting with weights derived from the normalized cross-correlation function computed in frequency bands, sec Formula (4) (or with the words of the original authors, the “interchannel short time coherence function”). The difference compared to [1] is that instead of attenuating the diffuse signal components, the direct signal components are attenuated using the spectral weights which are a monotonic steady function of (1−ρ(m, k)).
The decomposition for the application of upmixing of input signals having two channels using multichannel Wiener filtering has been described in [3]. The processing is done in the time-frequency domain. The input signal is modelled as mixture of the ambient signal and one active direct source (per frequency band), where the direct signal in one channel is restricted to be a scaled copy of the direct signal component in the second channel, i.e. amplitude panning. The panning coefficient and the powers of direct signal and ambient signal are estimated using the normalized cross-correlation and the input signal powers in both channels. The direct output signal and the ambient output signals are derived from linear combinations of the input signals, with real-valued weighting coefficients. Additional postscaling is applied such that the power of the output signals equals the estimated quantities.
The method described in [4] extracts an ambience signal using spectral weighting, based on an estimate of the ambience power. The ambience power is estimate based on the assumptions that the direct signal components in both channels are fully correlated, that the ambient channel signals are uncorrelated with each other and with the direct signals, and that the ambience powers in both channels are equal.
A method for upmixing of stereo signals based on Directional Audio Coding (DirAC) is described in [5]. DirAC aims analyzing and reproducing of direction of arrival, diffuseness and the spectrum of a sound field. For upmixing of stereo input signals, anechoic B-format recordings of the input signals are simulated.
A method for extracting the uncorrelated reverberation from stereo audio signal using an adaptive filter algorithm which aims at predicting the direct signal component in one channel signal using the other channel signal by means of a Least Mean Square (LMS) algorithm is described in [6]. Subsequently the ambient signals are derived by subtracting the estimated direct signals from the input signals. The rationale of this approach is that the prediction only works for correlated signals and the prediction error resembles the uncorrelated signal. Various adaptive filter algorithms based on the LMS principle exist and are feasible, e.g. the LMS or the Normalized LMS (NLMS) algorithm.
For the decomposition of input signals with more than two channels, a method is described in [7] where the multichannel signals are firstly downmixed to obtain a 2-channel stereo signal and subsequently a method for processing stereo input signals presented in [3] is applied.
For the processing of mono signals, the method described in [8] extracts an ambience signal using spectral weighting where the spectral weights are computed using feature extraction and supervised learning.
Another method for extracting an ambience signal from mono recordings for the application of upmixing obtains the time-frequency domain representation from the difference of the time-frequency domain representation of the input signal and a compressed version of it, advantageously computed using non-negative matrix factorization [9].
A method for extracting and changing the reverberant signal components in an audio signal based on the estimation of the magnitude transfer function of the reverberant system which has generated the reverberant signal is described in [10]. An estimate of the magnitudes of the frequency domain representation of the signal components is derived by means of recursive filtering and can be modified.