In modern digital audio systems, it is a major trend to allow for audio-object related modifications of the transmitted content on the receiver side. These modifications include gain modifications of selected parts of the audio signal and/or spatial re-positioning of dedicated audio objects in case of multi-channel playback via spatially distributed speakers. This may be achieved by individually delivering different parts of the audio content to the different speakers.
In other words, in the art of audio processing, audio transmission, and audio storage, there is an increasing desire to allow for user interaction on object-oriented audio content playback and also a demand to utilize the extended possibilities of multi-channel playback to individually render audio contents or parts thereof in order to improve the hearing impression. By this, the usage of multi-channel audio content brings along significant improvements for the user. For example, a three-dimensional hearing impression can be obtained, which brings along an improved user satisfaction in entertainment applications. However, multi-channel audio content is also useful in professional environments, for example in telephone conferencing applications, because the talker intelligibility can be improved by using a multi-channel audio playback. Another possible application is to offer to a listener of a musical piece to individually adjust playback level and/or spatial position of different parts (also termed as “audio objects”) or tracks, such as a vocal part or different instruments. The user may perform such an adjustment for reasons of personal taste, for easier transcribing one or more part(s) from the musical piece, educational purposes, karaoke, rehearsal, etc.
The straightforward discrete transmission of all digital multi-channel or multi-object audio content, e.g., in the form of pulse code modulation (PCM) data or even compressed audio formats, demands very high bitrates. However, it is also desirable to transmit and store audio data in a bitrate efficient way. Therefore, one is willing to accept a reasonable tradeoff between audio quality and bitrate requirements in order to avoid an excessive resource load caused by multi-channel/multi-object applications.
Recently, in the field of audio coding, parametric techniques for the bitrate-efficient transmission/storage of multi-channel/multi-object audio signals have been introduced by, e.g., the Moving Picture Experts Group (MPEG) and others. One example is MPEG Surround [ISO/IEC 23003-1:2007, MPEG-D (MPEG audio technologies), Part 1: MPEG Surround, 2007] as a channel oriented approach [ISO/IEC 23003-1:2007, MPEG-D (MPEG audio technologies), Part 1: MPEG Surround, 2007, and C. Faller and F. Baumgarte, “Binaural Cue Coding—Part II: Schemes and applications,” IEEE Trans. on Speech and Audio Proc., vol. 11, no. 6, November 2003], or MPEG Spatial Audio Object Coding (SAOC) as an object oriented approach [C. Faller, “Parametric Joint-Coding of Audio Sources,” 120th AES Convention, Paris, 2006; ISO/IEC, “MPEG audio technologies—Part 2: Spatial Audio Object Coding (SAOC)”, ISO/IEC JTC1/SC29/WG11 (MPEG) International Standard 23003-2; J. Herre, S. Disch, J. Hilpert, O. Hellmuth: “From SAC To SAOC—Recent Developments in Parametric Coding of Spatial Audio,” 22nd Regional UK AES Conference, Cambridge, UK, April 2007; J. Engdegård, B. Resch, C. Falch, O. Hellmuth, J. Hilpert, A. Holzer, L. Terentiev, J. Breebaart, J. Koppens, E. Schuijers and W. Oomen: “Spatial Audio Object Coding (SAOC)—The Upcoming MPEG Standard on Parametric Object Based Audio Coding,” 124th AES Convention, Amsterdam 2008]. Another object-oriented approach is termed as “informed source separation” [M. Parvaix and L. Girin: “Informed Source Separation of underdetermined instantaneous Stereo Mixtures using Source Index Embedding,” IEEE ICASSP, 2010; M. Parvaix, L. Girin, J.-M. Brassier: “A watermarking-based method for informed source separation of audio signals with a single sensor,” IEEE Transactions on Audio, Speech and Language Processing, 2010; A. Liutkus and J. Pinel and R. Badeau and L. Girin and G. Richard: “Informed source separation through spectrogram coding and data embedding,” Signal Processing Journal, 2011; A. Ozerov, A. Liutkus, R. Badeau, G. Richard: “Informed source separation: source coding meets source separation,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2011; Shuhua Zhang and Laurent Girin: “An Informed Source Separation System for Speech Signals,” INTERSPEECH, 2011; L. Girin and J. Pinel: “Informed Audio Source Separation from Compressed Linear Stereo Mixtures,” AES 42nd International Conference: Semantic Audio, 2011]. These techniques aim at reconstructing a desired output audio scene or a desired audio source object on the basis of a downmix of channels/objects and additional side information describing the transmitted/stored audio scene and/or the audio source objects in the audio scene.
The estimation and the application of channel/object related side information in such systems is done in a time-frequency selective manner. Therefore, such systems employ time-frequency transforms such as the Discrete Fourier Transform (DFT), the Short Time Fourier Transform (STFT) or filter banks like Quadrature Mirror Filter (QMF) banks, etc. The basic principle of such systems is depicted in FIG. 1, using the example of MPEG SAOC.
In case of the STFT, the temporal dimension is represented by the time-block number and the spectral dimension is captured by the spectral coefficient (“bin”) number. In case of QMF, the temporal dimension is represented by the time-slot number and the spectral dimension is captured by the sub-band number. If the spectral resolution of the QMF is improved by subsequent application of a second filter stage, the entire filter bank is termed hybrid QMF and the fine resolution sub-bands are termed hybrid sub-bands.
As already mentioned above, in SAOC the general processing is carried out in a time-frequency selective way and can be described as follows within each frequency band:                N input audio object signals s1 . . . sN are mixed down to P channels x1 . . . xP as part of the encoder processing using a downmix matrix consisting of the elements d1,1 . . . dN,P. In addition, the encoder extracts side information describing the characteristics of the input audio objects (Side Information Estimator (SIE) module). For MPEG SAOC, the relations of the object powers w.r.t. each other are the most basic form of such a side information.        Downmix signal(s) and side information are transmitted/stored. To this end, the downmix audio signal(s) may be compressed, e.g., using well-known perceptual audio coders such MPEG-1/2 Layer II or III (aka .mp3), MPEG-2/4 Advanced Audio Coding (AAC) etc.        On the receiving end, the decoder conceptually tries to restore the original object signals (“object separation”) from the (decoded) downmix signals using the transmitted side information. These approximated object signals ŝ1 . . . ŝN are then mixed into a target scene represented by M audio output channels ŷ1 . . . ŷM using a rendering matrix described by the coefficients r1,1 . . . rN,M in FIG. 1. The desired target scene may be, in the extreme case, the rendering of only one source signal out of the mixture (source separation scenario), but also any other arbitrary acoustic scene consisting of the objects transmitted.        
Time-frequency based systems may utilize a time-frequency (t/f) transform with static temporal and frequency resolution. Choosing a certain fixed t/f-resolution grid typically involves a trade-off between time and frequency resolution.
The effect of a fixed t/f-resolution can be demonstrated on the example of typical object signals in an audio signal mixture. For example, the spectra of tonal sounds exhibit a harmonically related structure with a fundamental frequency and several overtones. The energy of such signals is concentrated at certain frequency regions. For such signals, a high frequency resolution of the utilized t/f-representation is beneficial for separating the narrowband tonal spectral regions from a signal mixture. In the contrary, transient signals, like drum sounds, often have a distinct temporal structure: substantial energy is only present for short periods of time and is spread over a wide range of frequencies. For these signals, a high temporal resolution of the utilized t/f-representation is advantageous for separating the transient signal portion from the signal mixture.
It would be desirable to take into account the different needs of different types of audio objects regarding their representation in the time-frequency domain when generating and/or evaluating object-specific side information at the encoder side or at the decoder side, respectively.