The present invention relates to audio processing and, in particular to audio signal decomposition into different components such as perceptually distinct components.
The human auditory system senses sound from all directions. The perceived auditory (the adjective auditory denotes what is perceived, while the word sound will be used to describe physical phenomena) environment creates an impression of the acoustic properties of the surrounding space and the occurring sound events. The auditory impression perceived in a specific sound field can (at least partially) be modeled considering three different types of signals at the car entrances: The direct sound, early reflections, and diffuse reflections. These signals contribute to the formation of a perceived auditory spatial image.
Direct sound denotes the waves of each sound event that first reach the listener directly from a sound source without disturbances. It is characteristic for the sound source and provides the least-compromised information about the direction of incidence of the sound event. The primary cues for estimating the direction of a sound source in the horizontal plane are differences between the left and right ear input signals, namely interaural time differences (ITDs) and interaural level differences (ILDs). Subsequently, a multitude of reflections of the direct sound arrive at the ears from different directions and with different relative time delays and levels. With increasing time delay, relative to the direct sound, the density of the reflections increases until they constitute a statistical clutter.
The reflected sound contributes to distance perception, and to the auditory spatial impression, which is composed of at least two components: apparent source width (ASW) (Another commonly used term for ASW is auditory spaciousness) and listener envelopment (LEV). ASW is defined as a broadening of the apparent width of a sound source and is primarily determined by early lateral reflections. LEV refers to the listener's sense of being enveloped by sound and is determined primarily by late-arriving reflections. The goal of electroacoustic stereophonic sound reproduction is to evoke the perception of a pleasing auditory spatial image. This can have a natural or architectural reference (e.g. the recording of a concert in a hall), or it may be a sound field that is not existent in reality (e.g. electroacoustic music).
From the field of concert hall acoustics, it is well known that—to obtain a subjectively pleasing sound field—a strong sense of auditory spatial impression is important, with LEV being an integral part. The ability of loudspeaker setups to reproduce an enveloping sound field by means of reproducing a diffuse sound field is of interest. In a synthetic sound field it is not possible to reproduce all naturally occurring reflections using dedicated transducers. That is especially true for diffuse later reflections. The timing and level properties of diffuse reflections can be simulated by using “reverberated” signals as loudspeakers feeds. If those are sufficiently uncorrelated, the number and location of the loudspeakers used for playback determines if the sound field is perceived as being diffuse. The goal is to evoke the perception of a continuous, diffuse sound field using only a discrete number of transducers. That is, creating sound fields where no direction of sound arrival can be estimated and especially no single transducer can be localized. The subjective diffuseness of synthetic sound fields can be evaluated in subjective tests.
Stereophonic sound reproductions aim at evoking the perception of a continuous sound field using only a discrete number of transducers. The features desired the most are directional stability of localized sources and realistic rendering of the surrounding auditory environment. The majority of formats used today to store or transport stereophonic recordings are channel-based. Each channel conveys a signal that is intended to be played back over an associated loudspeaker at as specific position. A specific auditory image is designed during the recording or mixing process. This image is accurately recreated if the loudspeaker setup used for reproduction resembles the target setup that the recording was designed for.
The number of feasible transmission and playback channels constantly grows and with every emerging audio reproduction format comes the desire to render legacy format content over the actual playback system. Upmix algorithms are a solution to this desire, computing a signal with more channels from a legacy signal. A number of stereo upmix algorithms have been proposed in the literature, e.g. Carlos Avendano and Jean-Marc Jot, “A frequency-domain approach to multichannel upmix”, Journal of the Audio Engineering Society, vol. 52, no. 7/8, pp. 740-749, 2004; Christof Faller, “Multiple-loudspeaker playback of stereo signals,” Journal of the Audio Engineering Society, vol. 54, no. 11, pp. 1051-1064, November 2006; John Usherand Jacob Benesty, “Enhancement of spatial sound quality: A new reverberation-extraction audio upmixer,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2141-2150, September 2007. Most of these algorithms are based on a direct/ambient signal decomposition followed by rendering adapted to the target loudspeaker setup.
The described direct/ambient signal decompositions are not readily applicable to multi-channel surround signals. It is not easy to formulate a signal model and filtering to obtain from N audio channels the corresponding N direct sound and N ambient sound channels. The simple signal model used in the stereo case, see e.g. Christof Faller, “Multiple-loudspeaker playback of stereo signals,” Journal of the Audio Engineering Society, vol. 54, no. 11, pp. 1051-1064, November 2006, assuming direct sound to be correlated amongst all channels, does not capture the diversity of channel relations that can exist between surround signal channels.
The general goal of stereophonic sound reproduction is to evoke the perception of a continuous sound field using only a limited number of transmission channels and transducers. Two loudspeakers are the minimum requirement for spatial sound reproduction. Modern consumer systems often offer a larger number of reproduction channels. Basically, stereophonic signals (independent of the number of channels) are recorded or mixed such that for each source the direct sound goes coherent (=dependent) into a number of channels with specific directional cues and reflected independent sounds go into a number of channels determining cues for apparent source width and listener envelopment. Correct perception of the intended auditory image is usually only possible in the ideal point of observation in the playback setup the recording was intended for. Adding more speakers to a given loudspeaker setup usually enables a more realistic reconstruction/simulation of a natural sound field. To use the full advantage of an extended loudspeaker setup if the input signals are given in another format, or to manipulate the perceptually distinct parts of the input signal, those have to be separately accessible. This specification describes a method to separate the dependent and independent components of stereophonic recordings comprising an arbitrary number of input channels below.
A decomposition of audio signals into perceptually distinct components is necessitated for high quality signal modification, enhancement, adaptive playback, and perceptual coding. A number of methods have recently been proposed that allow the manipulation and/or extraction of perceptually distinct signal components from two-channel input signals. Since input signals with more than two channels become more and more common, the described manipulations are desirable also for multichannel input signals. However, most of the concepts described for two-channel input can not easily be extended to work with input signals with an arbitrary number of channels.
If one were to perform a signal analysis into direct and ambience parts with, for example, a 5.1 channel surround signal having a left channel, a center channel, a right channel, a left surround channel, a right surround channel and a low-frequency enhancement (subwoofer), it is not straight-forward how one should apply a direct/ambience signal analysis. One might think of comparing each pair of the six channels resulting in a hierarchical processing which has, in the end, up to 15 different comparison operations. Then, when all of these 15 comparison operations have been done, where each channel has been compared to every other channel, one would have to determine how one should evaluate the 15 results. This is time consuming, the results are hard to interprete, and due to the considerable amount of processing resources, not usable for e.g. real-time applications of direct/ambience separation or, generally, signal decompositions which may be, for example, used in the context of upmix or any other audio processing operations.
In M. M. Goodwin and J. M. Jot, “Primary-ambient signal decomposition and vector-based localization for spatial audio coding and enhancement,” in Proc. Of ICASSP 2007, 2007, a principal component analysis is applied to the input channel signals to perform the primary (=direct) and ambient signal decomposition.
The models used in Christof Faller, “Multiple-loudspeaker playback of stereo signals,” Journal of the Audio Engineering Society, vol. 54, no. 11, pp. 1051-1064, November 2006 and C. Faller, “A highly directive 2-capsule based microphone system,” in Preprint 123rd Conv. Aud. Eng. Soc., October 2007 assume de-correlated or partially correlated diffuse sound in stereo and microphone signals, respectively. They derive filters for extracting diffuse/ambient signal given this assumption. These approaches are limited to single and two channel audio signals.
A further reference is C. Avendano and J.-M. Jot, “A frequency-domain approach to multichannel upmix”, Journal of the Audio Engineering Society, vol. 52, no. 7/8, pp. 740-749, 2004. The reference M. M. Goodwin and J. M. Jot, “Primary-ambient signal decomposition and vector-based localization for spatial audio coding and enhancement,” in Proc. Of ICASSP 2007, 2007, comments on the Avendano, Jot reference as follows. The reference provides an approach which involves creating a time-frequency mask to extract the ambience from a stereo input signal. The mask is based on the cross-correlation between the left- and right channel signals, however, so this approach is not immediately applicable to the problem of extracting ambience from an arbitrary multichannel input. To use any such correlation-based method in this higher-order case would call for a hierarchical pairwise correlation analysis, which would entail a significant computational cost, or some alternate measure of multichannel correlation.
Spatial Impulse Response Rendering (SIRR) (Juha Merimaa and Ville Pulkki, “Spatial impulse response rendering”, in Proc. of the 7th Int. Conf on Digital Audio Effects (DAFx '04), 2004) estimates the direct sound with direction and diffuse sound in B-Format impulse responses. Very similar to SIRR, Directional Audio Coding (DirAC) (Ville Pulkki, “Spatial sound reproduction with directional audio coding,” Journal of the Audio Engineering Society, vol. 55, no. 6, pp. 503-516, June 2007) implements similar direct and diffuse sound analysis to B-Format continuous audio signals.
The approach presented in Julia Jakka, Binaural to Multichannel Audio Upmix, Ph.D. thesis, Master's Thesis, Helsinki University of Technology, 2005 describes an upmix using binaural signals as input.
The reference Boaz Rafaely, “Spatially Optimal Wiener Filtering in a Reverberant Sound Field, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2001, October 21 to 24, 2001, New Paltz, N.Y.,” describes the derivation of Wiener filters which are spatially optimal for reverberant sound fields. An application to two-microphone noise cancellation in reverberant rooms is given. The optimal filters which are derived from the spatial correlation of diffuse sound fields capture the local behavior of the sound fields and are therefore of lower order and potentially more spatially robust than conventional adaptive noise cancellation filters in reverberant rooms. Formulations for unconstrained and causally constrained optimal filters are presented and an example application to a two-microphone speech enhancement is demonstrated using a computer simulation.