The present invention relates to audio signal processing and, in particular, to a center signal scaling and stereophonic enhancement based on the signal-to-downmix ratio.
Audio signals are in general a mixture of direct sounds and ambient (or diffuse) sounds. Direct signals are emitted by sound sources, e.g., a musical instrument, a vocalist, or a loudspeaker, and arrive on the shortest possible path at the receiver, e.g., the listener's ear or a microphone. When listening to a direct sound, it is perceived as coming from the direction of the sound source. The relevant auditory cues for the localization and for other spatial sound properties are interaural level difference (ILD), interaural time difference (ITD), and interaural coherence. Direct sound waves evoking identical ILD and ITD are perceived as coming from the same direction. In the absence of ambient sound, the signals reaching the left and the right ear or any other set of spaced sensors are coherent.
Ambient sounds, in contrast, are emitted by many spaced sound sources or sound reflecting boundaries contributing to the same sound. When a sound wave reaches a wall in a room, a portion of it is reflected, and the superposition of all reflections in a room, the reverberation, is a prominent example for ambient sounds. Other examples are applause, babble noise, and wind noise. Ambient sounds are perceived as being diffuse, not locatable, and evoke an impression of envelopment (of being “immersed in sound”) by the listener. When capturing an ambient sound field using a set of spaced sensors, the recorded signals are at least partially incoherent.
Related known technology on separation, decomposition, or scaling is either based on panning information, i.e., inter-channel level differences (ICLD) and inter-channel time differences (ICTD), or based on signal characteristics of direct and of ambient sounds. Methods taking advantage of ICLD in two-channel stereophonic recordings are the upmix method described in C. Avendano and J.-M. Jot, “A frequency-domain approach to multi-channel upmix,” J. Audio Eng. Soc., vol. 52, 2004; the Azimuth Discrimination and Resynthesis (ADRess) algorithm described in D. Barry, B. Lawlor, and E. Coyle, “Sound source separation: Azimuth discrimination and resynthesis,” in Proc. Int. Conf. Digital Audio Effects (DAFx), 2004; the upmix from two-channel input signals to three channels proposed by E. Vickers in “Two-to-three channel upmix for center channel derivation and speech enhancement,” in Proc. Audio Eng. Soc. 127th Conv., 2009; and the center signal extraction described in D. Jang, J. Hong, H. Jung, and K. Kang, “Center channel separation based on spatial analysis,” in Proc. Int. Conf. Digital Audio Effects (DAFx), 2008.
The Degenerate Unmixing Estimation Technique (DUET) described in A. Jourjine, S. Rickard, and O. Yilmaz, “Blind separation of disjoint orthogonal signals: Demixing N sources from 2 mixtures,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2000; and O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. on Signal Proc., vol. 52, pp. 1830-1847, 2004, is based on clustering the time-frequency bins into sets with similar ICLD and ICTD. A restriction of the original method is that the maximum frequency which can be processed equals half the speed of sound over maximum microphone spacing (due to ambiguities in the ICTD estimation) which has been addressed in S. Rickard, “The DUET blind source separation algorithm,” in Blind Speech Separation, S: Makino, T.-W. Lee, and H. Sawada, Eds. Springer, 2007. The performance of the method decreases when sources overlap in the time-frequency domain and when the reverberation increases. Other methods based on ICLD and ICTD are the Modified ADRess algorithm described in N. Cahill, R. Cooney, K. Humphreys, and R. Lawlor, “Speech source enhancement using a modified ADRess algorithm for applications in mobile communications,” in Proc. Audio Eng. Soc. 121st Conv., 2006, which extends ADRess algorithm described in D. Barry, B. Lawlor, and E. Coyle, “Sound source separation: Azimuth discrimination and resynthesis,” in Proc. Int. Conf. Digital Audio Effects (DAFx), 2004, for the processing of spaced microphone recordings, the method based on time-frequency correlation (AD-TIFCORR) described in M. Puigt and Y. Deville, “A time-frequency correlation-based blind source separation method for time-delay mixtures,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2006, for time-delayed mixtures, the Direction Estimation of Mixing Matrix (DEMIX) for anechoic mixtures described in Simon Arberet, Remi Gribonval, and Frederic Bimbot, “A robust method to count and locate audio sources in a stereophonic linear anechoic mixture,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2007, which includes a confidence measure that only one source is active at a particular time-frequency bin, the Model-based Expectation-Maximization Source Separation and Localization (MESSL) described in M. I. Mandel, R. J. Weiss, and D. P. W. Ellis, “Model-based expectation-maximization source separation and localization,” IEEE Trans. on Audio, Speech and Language Proc., vol. 18, pp. 382-394, 2010, and methods mimicking the binaural human hearing mechanism as in, e.g., H. Viste and G. Evangelista, “On the use of spatial cues to improve binaural source separation,” in Proc. Int. Conf. Digital Audio Effects (DAFx), 2003; and A. Favrot, M. Erne, and C. Faller, “Improved cocktail-party processing,” in Proc. Int. Conf. Digital Audio Effects (DAFx), 2006.
Despite the methods for Blind Source Separation (BSS) using spatial cues of direct signal components mentioned above, also the extraction and attenuation of ambient signals are related to the presented method. Methods based on the inter-channel coherence (ICC) in two-channel signals are described in J. B. Allen, D. A. Berkeley, and J. Blauert, “Multimicrophone signal-processing technique to remove room reverberation from speech signals,” J. Acoust. Soc. Am., vol. 62, 1977; C. Avendano and J.-M. Jot, “A frequency-domain approach to multi-channel upmix,” J. Audio Eng. Soc., vol. 52, 2004; and Merimaa, M. Goodwin, and J.-M. Jot, “Correlation-based ambience extraction from stereo recordings,” in Proc. Audio Eng. Soc. 123rd Cony., 2007. The application of adaptive filtering has been proposed in J. Usher and J. Benesty, “Enhancement of spatial sound quality: A new reverberation-extraction audio upmixer,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, pp. 2141-2150, 2007, with the rationale that direct signals can be predicted across channels, whereas diffuse sounds are obtained from the prediction error.
A method for upmixing of two-channel stereophonic signals based on multichannel Wiener filtering estimates both the ICLD of direct sounds and the power spectral densities (PSD) of the direct and ambient signal components described in C. Faller, “Multiple-loudspeaker playback of stereo signals,” J. Audio Eng. Soc., vol. 54, 2006.
Approaches to the extraction of ambient signals from single channel recordings include the use of Non-Negative Matrix Factorization of a time-frequency representation of the input signal, where the ambient signal is obtained from the residual of that approximation as described in C. Uhle, A. Walther, O. Hellmuth, and J. Herre, “Ambience separation from mono recordings using Non-negative Matrix Factorization,” in Proc. Audio Eng. Soc. 30th Int. Conf., 2007; low-level feature extraction and supervised learning as described in C. Uhle and C. Paul, “A supervised learning approach to ambience extraction from mono recordings for blind upmixing,” in Proc. Int. Conf. Digital Audio Effects (DAFx), 2008; and the estimation of the impulse response of a reverberant system and inverse filtering in the frequency domain as described in G. Soulodre, “System for extracting and changing the reverberant content of an audio input signal,” U.S. Pat. No. 8,036,767, October 2011.