The task of separating a mixture of superimposed sound sources into its constituent components has gained importance in digital audio signal processing. In speech processing, these components are usually the utterances of target speakers interfered by noise or simultaneously speaking persons. In music, these components can be individual instrumental or vocal melodies, percussive instruments, or even individual note events. Relevant topics are signal reconstruction and transient preservation and score-informed audio composition (i.e. source separation).
Music source separation aims at decomposing a polyphonic, multitimbral music recording into component signals such as singing voice, instrumental melodies, percussive instruments, or individual note events occurring in a mixture signal. Besides being an important step in many music analysis and retrieval tasks, music source separation is also a fundamental prerequisite for applications such as music restoration, upmixing, and remixing. For these purposes, high fidelity in terms of perceptual quality of the separated components is desirable. The majority of existing separation techniques work on a time-frequency (TF) representation of the mixture signal, often the Short-Time Fourier Transform (STFT). The target component signals are usually reconstructed using a suitable inverse transform, which in turn can introduce audible artifacts such as musical noise, smeared transients or pre-echos. Existing approaches suffer from audible artifacts in the form of musical noise, phase interference and pre-echos. These artifacts are often quite disturbing for the human listener.
There is a number of recent papers on music source separation. In most approaches, the separation is carried out in the time-frequency (TF) domain by modifying the magnitude spectrogram. The corresponding time-domain signals of the separated components are derived by using the original phase information and applying suitable inverse transforms. When striving for good perceptual quality of the separated solo signals, many authors revert to score-informed decomposition techniques. This has the advantage that the separation can be guided by information on the approximate location of component signals in time (onset, offset) and frequency (pitch, timbre). Fewer publications deal with source separation of transient signals such as drums. Others have focused on the separation of harmonic vs. percussive components [5].
Moreover, the problem of pre-echos has been addressed in the field of perceptual audio coding, where pre-echos are typically caused by the use of relatively long analysis and synthesis windows in conjunction with intermediate manipulation of TF bins such as quantization of spectral magnitudes according to a psycho-acoustic model. It can be considered state-of-the-art to use block-switching in the vicinity of transient events [6]. An interesting approach was proposed in [13] where spectral coefficients are encoded by linear prediction along the frequency axis, automatically reducing pre-echos. Later works proposed to decompose the signal into transient and residual components and use optimized coding parameters for each stream [3]. Transient preservation has also been investigated in the context of time-scale modification methods based on the phase-vocoder. In addition to optimized treatment of the transient components, several authors follow the principle of phase-locking or re-initialization of phase in transient frames [8].
The problem of signal reconstruction, also known as magnitude spectrogram inversion or phase estimation is a well-researched topic. In their classic paper [1], Griffin and Lim proposed the so-called LSEE-MSTFTM algorithm for iterative, blind signal reconstruction from modified STFT magnitude (MSTFTM) spectrograms. In [2], Le Roux et al. developed a different view on this method by describing it using a TF consistency criterion. By keeping the operations entirely in the TF domain, several simplifications and approximations could be introduced that lower the computational load compared to the original procedure. Since the phase estimates obtained using LSEE-MSTFTM can only converge to local optima, several publications were concerned with finding a good initial estimate for the phase information [3, 4]. Sturmel and Daudet [5] provided an in-depth review of signal reconstruction methods and pointed out unsolved problems. An extension of LSEE-MSTFTM with respect to convergence speed was proposed in [6]. Other authors tried to formulate the phase estimation problem as a convex optimization scheme and arrived at promising results hampered by high computational complexity [7]. Another work [8] was concerned with applying the spectrogram consistency framework to signal reconstruction from wavelet-based magnitude spectrograms.
However, the described approaches for signal reconstruction share the issue that a rapid change of the audio signal, which is, for example, typical for transients, may suffer from the earlier described artifacts such as, for example, pre-echos.
Therefore, there is a need for an improved approach.