A soundtrack of a movie or a TV show consists of dialogue superimposed with special audio effects and/or music. For an old movie, the soundtrack is a mixture of at least two of these components. Thus, if one wishes to broadcast the movie in a version other than the original one, one may need to separate the dialogue component from the background component in the original soundtrack. Doing so makes it possible to add, onto an isolated background component, a dubbed dialogue in a different language in order to produce a new soundtrack.
In some situations, the producers of a movie may only have a license to broadcast a piece of music in a particular country or region or for a limited duration of time. It may be illegal to broadcast a movie for which the soundtrack does not conform to the contract terms. To broadcast the movie, it may then be necessary to separate the dialogue component of the soundtrack from the background component of the soundtrack in order to use the isolated original dialogue to a new piece of music in order to get a new soundtrack.
In the general field of audio signal processing, source separation has been an important topic during the past decade. In the prior art, audio source separation was first addressed in a blind context. Non-negative matrix factorization (NMF) has been widely used in this context. For instance, the document by T. Virtanen, “Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 3, pp. 1066-1074, March 2007, divulges an NMF for source separation. However, one of the main drawbacks of this technique is the difficulty to cluster the factorized elements and associate them with a particular source.
More recently, numerous works have proposed adding extra information to NMF methods to improve results. In the particular field of musical source separation (i.e. separation of an instrument from a band or orchestra), an algorithm was proposed in which the different spectral shapes of each source are learned on isolated sounds and then used to decompose the mixture. In another work, a MIDI file is used to guide the separation of instruments in music pieces.
In the particular field of separating speech from background noise, one proposal has been to use a guide sound signal and to mimick the dialogue component of the mixture signal in order to guide the separation process. More particularly, the guide signals correspond to a recording of the voice of a speaker dubbing the target dialogue component that is to be separated. The document P. Smaragdis and G. Mysore “Separation by Humming: User-Guided Sound Extraction from Monophonic Mixture,” in Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, N.Y., USA, October 2009 proposed such an approach. In this document, the authors use a process based on Probabilistic Latent Component Analysis (PLCA). This process uses a guide signal that mimics the dialogue component to be extracted from the audio mixture signal and is set as an input to the PLCA.
The document by L. Le Magoarou et al. “Text-Informed Audio Source Separation Using Nonnegative Matrix Partial Co-Factorization,” in IEEE International Workshop on Machine Learning for Signal Processing, Southampton, UK, September 2013 divulges an algorithm, based on a source-filter model for vocal production in the dialogue contribution of the mixture signal and in the guide signal, that models time misalignments and equalization differences but does not model pitch differences between a guide signal and the dialogue contribution of the mixture signals.