Signal mixing consists in summing a plurality of signals, referred to as source signals, in order to obtain one or more composite signals, referred to as mixed signals. In audio applications in particular, mixing may consist merely in a step of adding source signals together, or it may also include steps of filtering signals before and/or after adding them together. Furthermore, for certain applications such as compact disk (CD) audio, the source signals may be mixed in different manners in order to form two mixed signals corresponding to the two (left and right) channels or paths of a stereo signal.
Separating sources consists in estimating the source signals from an observation of a certain number of different mixed signals made from those source signals. The purpose is generally to heighten one or more target source signals, or indeed, if possible, to extract them completely. Source separation is difficult in particular in situations that are said to be “underdetermined”, in which the number of mixed signals available is less than the number of source signals present in the mixed signals. Extraction is then very difficult or indeed impossible because of the small amount of information available in the mixed signals compared with that present in the source signals. A particularly representative example is constituted by CD audio music signals, since there are only two stereo channels available (i.e. a left mixed signal and a right mixed signal), which two signals are generally highly redundant, and apply to a number of source signals that is potentially large.
There exist several types of approach for separating source signals: these include blind separation; computational auditory scene analysis; and separation based on models. Blind separation is the most general form, in which no information is known a priori about the source signals or about the nature of the mixed signals. A certain number of assumptions are then made about the source signals and the mixed signals (e.g. that the source signals are statistically independent), and the parameters of a separation system are estimated by maximizing a criterion based on those assumptions (e.g. by maximizing the independence of the signals obtained by the separator device). Nevertheless, that method is generally used when numerous mixed signals are available (at least as many as there are source signals), and it is therefore not applicable to underdetermined situations in which the number of mixed signals is less than the number of source signals.
Computational auditory scene analysis generally consists in modeling source signals as partials, but the mixed signal is not explicitly decomposed. This method is based on the mechanisms of the human auditory system for separating source signals in the same manner as is done by our ears. Mention may be made in particular of: D. P. W. Ellis, Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis, and its application to speech/non-speech mixture (Speech Communication, 27(3), pp. 281-298, 1999); D. Godsmark and G. J. Brown, A blackboard architecture for computational auditory scene analysis (Speech Communication, 27(3), pp. 351-366, 1999); and also T. Kinoshita, S. Sakai, and H. Tanaka, Musical source signal identification based on frequency component adaptation (In Proc. IJCAI Workshop on CASA, pp. 18-24, 1999). Nevertheless, at present computational auditory scene analysis gives rise to results that are insufficient in terms of the quality of the separated source signals.
Another form of separation relies on decomposition of the mixture on the basis of adaptive functions. There exist two major categories: parsimonious time decomposition and parsimonious frequency decomposition.
For parsimonious time decomposition, the waveform of the mixture is decomposed, whereas for parsimonious frequency decomposition, it is its spectral representation that is decomposed, thereby obtaining a sum of elementary functions referred to as “atoms” constituting elements of a dictionary. Various algorithms can be used for selecting the type of dictionary and the most likely corresponding decomposition. For the time domain, mention may be made in particular of: L. Benaroya, Représentations parcimonieuses pour la séparation de sources avec un seul capteur [Parsimonious representations for separating sources with a single sensor] (Proc. GRETSI, 2001); or P. J. Wolfe and S. J. Godsill, A Gabor regression scheme for audio signal analysis (Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 103-106, 2003). In the method proposed by Gribonval (R. Gribonval and E. Bacry, Harmonic decomposition of audio signals with matching pursuit, IEEE Trans. Signal Proc., 51(1) pp. 101-112, 2003), the decomposition atoms are classified into independent subspaces, thereby enabling groups of harmonic partials to be extracted. One of the restrictions of that method is that generic dictionaries of atoms, such as Gabor atoms for example, that are not adapted to the signals, do not give good results. Furthermore, in order for those decompositions to be effective, it is necessary for the dictionary to contain all of the translated forms of the waveforms of each type of instrument. The decomposition dictionaries then need to be extremely voluminous in order for the projection, and thus the separation, to be effective.
In order to mitigate that problem of invariance under translation that appears in the time situation, there exist approaches for parsimonious frequency decomposition. Mention may be made in particular of M. A. Casey and A. Westner, Separation of mixed audio sources by independent subspace analysis, Proc. Int. Computer Music Conf., 2000, which introduces independent subspace analysis (ISA). Such analysis consists in decomposing the short-term amplitude spectrum of the mixed signal (calculated by a short-term Fourier transform (SIFT)) on the basis of atoms, and then in grouping the atoms together in independent subspaces, each subspace being specific to a source, in order subsequently to resynchronize the sources separately. Nevertheless, that is generally limited by several factors: the resolution of SIFT spectral analysis; the superposition of sources in the spectral domain; and spectral separation being restricted to amplitude (the phase of the resynchronized signals being that of the mixed signal). It is thus generally difficult to represent the mixed signal as being a sum of independent subspaces because of the complexity of the sound scene in the spectral domain (considerable overlap of the various components) and because of the way the contribution of each component in the mixed signal varies as a function of time. Methods are often evaluated on the basis of “simplified” mixed signals that are well controlled (the source signals are MIDI instruments or are instruments that are relatively easy to separate, and few in number).
Another method of separating sources is “informed” source separation: information about one or more source signals is transmitted to the decoder together with the mixed signal. On the basis of algorithms and of said information, the decoder is then capable of separating at least one source signal from the mixed signal, at least in part. An example of informed source separation is described by M. Parvaix and L. Girin, Informed source separation of linear instantaneous underdetermined audio mixtures by source index embedding, IEEE Trans. Audio Speech Lang. Process., Vol. 19, pp. 1721-1733, August 2011. The information transmitted to the decoder specifies in particular the two predominant source signals in the mixed signal, for various frequency ranges. Nevertheless, such a method is not always appropriate when more than two source signals exist that are contributing simultaneously in a common frequency range of the mixed signal: under such circumstances, at least one source signal becomes neglected, thereby creating a “spectral hole” in the reconstruction of said source signal.
It is also known, in particular in the field of telecommunications, to filter signals that have been picked up using a plurality of sensors as a function of the positions of said signals in three-dimensional space relative to said sensors. That constitutes spatial filtering (or indeed “beamforming”) that serves to give precedence to the signal in a given spatial direction, while filtering out signals coming from other directions. An example of such filters are linearly constrained minimum variance (LCMV) spatial filters. An example of such a filter is disclosed in particular in Document EP 1 633 121.