A soundtrack of a song is composed by a vocal component (the lyrics sung by one or more singers) and a musical component (the musical accompaniment or background played by one or more instruments). A soundtrack of a film has a vocal component (dialogue between actors) superimposed on a musical component (sound effects and/or background music). There are certain instances where one needs to separate a vocal component from a musical component in a soundtrack. For example, in a film, one may need to isolate the background component from the vocal component in order to use a dubbed dialogue in a different language to produce a new soundtrack.
Several algorithms which aim at separating the vocal component from the musical component exist in the literature. For example, the article by Jean-Louis Durrieu et al. “An Iterative Approach to Musical Mixture of Monaural-Soloing,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, April 2009, pp. 105-108 discloses a source separation algorithm in under-determined conditions based on a Non-negative Matrix Factorization (NMF) framework, that allows specifically for the separation of the vocal contribution from a music background contribution. However, known separation algorithms do not explicitly and properly deal with the reverberation effects that affect the components of the mixture.
In the particular case of a vocal component, the reverberated voice results from the superposition of the dry voice, corresponding to the recording of the sound produced by the singer that propagates directly to the microphone, and the reverb, corresponding to the recording of the sound produced by the singer that arrives indirectly to the microphone, i.e. by reflection, possibly multiple, on the walls of the recording room. The reverberation, composed of echoes of the pure voice at given instants, spreads over a time interval that may be significant (e.g. three seconds). Stated otherwise, at a given instant, the vocal component results from the superposition of the dry voice at this instant and the various echoes of the pure voice at preceding instants.
Existing separation algorithms do not take into account the long-term effects of reverberation affecting a component of the mixture of acoustic signals. The article by Ngoc Duong Q K, Emmanuel Vincent, and Remi Gribonval, “Underdetermined Reverberant Sound Source Separation Using a Full-Rank Spatial Covariance Model,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, no. 7, pp. 1830-1840, September 2010, focuses on the instantaneous effects of reverberation related to the spatial diffusion, but does not model memory effects, i.e. the delay between the recording of a dry sound and the recording of the echoes associated to that dry sound. Thus, the type of algorithm proposed by the authors of the article applies only to multi-channel signals and does not allow for a correct extraction of reverberation effects which are common in music. Thus, the reverberation that affects a specific component, for example the vocal component, is distributed in the various components obtained after the separation. As a result, the separated vocal component then loses its richness and the musical accompaniment component is not of good quality.