The present invention is related to audio signal processing and, in particular, to downmixing of a plurality of input signals to a downmix signal.
In signal processing, it often it necessitated to mix two or more signals to one sum signal. The mixing procedure usually comes along with some signal impairments, especially if two signals, which are to be mixed, contain similar but phase shifted signal parts. If those signals are summed up, the resulting signal contains severe comb-filter artifacts. To prevent those artifacts, different methods have been suggested being either very costly in terms of computational complexity or based on applying a correction gain or term to the already impaired signal.
Converting multi-channel audio signals into a fewer number of channels normally implies mixing several audio channels. The ITU, for instance, recommends using a time-domain, passive mix matrix with static gains for a downward conversion from a certain multi-channel setup to another [1]. In [2] a quite similar approach is proposed.
To increase dialogue intelligibility, a combined approach of using the ITU-based and a matrix-based downmix is proposed in [3]. Also, audio coders utilize a passive downmix of channels, e.g. in some parametric modules [4, 5, 6].
The approach described in [7] performs a loudness measurement of every input and output channel, i.e. of every single channel before and after the mixing process. By taking the ratio of the sum of the input energies (i.e. energy of the channels supposed to be mixed) and the output energy (i.e. energy of the mixed channels), gains can be derived such that signal energy loss and coloration effects are reduced.
The approach described in [8] performs a passive downmix which is afterwards transformed into frequency domain. The downmix is then analyzed by a spatial correction stage which tries to detect and correct any spatial inconsistencies through modifications to the inter-channel level differences and inter-channel phase differences. Then, an equalizer is applied to the signal to ensure the downmix signal has the same power as the input signal. In the last step, the downmix signal is transformed back into time domain.
A different approach is disclosed in [9, 10], where two signals, which are to be downmixed, are transformed into frequency domain and a desired/actual value pair is built. The desired value calculates as the root of the sum of the single energies, whereas the actual value computes as the root of energy of the sum signal. The two values are then compared and depending on the actual value being greater or less than the desired value, a different correction is applied to the actual value.
Alternatively, there are methods which aim on aligning the signals' phases, such that no signal cancelation effects occur due to phase differences. Such methods were proposed for instance for parametric stereo encoders [11, 12, 13].
A passive downmix as done in [1, 2, 3, 4, 5, 6] is the most straight forward approach to mix signals. But if no further action is taken, the resulting downmix signals might suffer from severe signal loss and comb-filtering effects.
The approaches described in [7, 8, 9, 10] perform a passive downmix, in the sense of equally mixing both signals, in the first step. Afterwards, some corrections are applied to the downmixed signal. This might help to reduce comb-filter effects, but on the other hand will introduce modulation artifacts. This is caused by rapidly changing correction gains/terms over time. Furthermore, a phase shift of 180 degrees between the signals to be downmixed still results in a zero value downmix and cannot be compensated for by applying, for instance, a correction gain.
A phase-align approach, such as mentioned in [11, 12, 13], may help to avoid unwanted signal cancelation; but due to still performing a simple add-up procedure of the phase-aligned signals comb-filter and cancelation may occur if phases are not estimated properly. Additionally, robustly estimating the phase relations between two signals is not an easy task and is computational intensive, especially if done for more than two signals.