Video surveillance is currently widespread in areas like crime prevention, private and public areas security purposes, abnormal events detection, traffic monitoring, customer behaviour or general data gathering, etc.
Most camera uses are primarily based on video only. However, there is an interest in recording other complementary data as well, in particular audio data.
Most of the current cameras are equipped with microphones. However, the use of audio stream is very limited today. This is essentially due to the particular constraints of video surveillance environments. First, video surveillance is typically used in complex audio environments, including noisy environments with many simultaneous sound sources. Secondly, it is not possible to focus on specific sources of interest.
For example, conversations of interest between people may be drowned out by ambient noise, so that the audio stream is generally not usable.
Therefore, solutions that make it possible to focus on specific sources and separate them from ambient noise are desirable. Sound sources separation techniques are therefore of interest in the context of video-surveillance.
Several sound source separation methods have been developed in the last decades. However, none concerns the video-surveillance context.
“Classical” signal processing methods such as Binary Masking (BM) and Independent Component Analysis (ICA) have been used first. However, their efficiency is very limited, and none is usable in the typical noisy environments of video surveillance.
Statistical signal processing methods have been developed more recently. The most advanced ones are known as “variance-based” methods. They are much more efficient and are more robust to noise, as compared to the classical methods. However, they need a particular initialization known as “training” of the sources. It consists in learning the audio signature of each source of interest to be separated. The training necessitates an individual recording of each target source alone, which is not possible in the video surveillance context.
FIGS. 1a-1b illustrate a “variance-based method”. Such method is based on Bayesian methods. The general principle of the method is to use statistics to infer causes from all available information, i.e to retrieve sources from a sound “mixture” (comprising the plurality of sound signals to be separated and the ambient noise) and additional information in the case of sound separation. Variance-based methods are based on the variance of the signals rather than the signals themselves. Variance is easier to manipulate and separate than the source signals themselves.
FIG. 1a illustrates a sound source separation context. A plurality of sources “source 1”, . . . , “source n” emit respective sound signals represented, for each time-frequency bin (f,n), by their respective power spectra v1, . . . , vn. The sound signal of the ambient noise is represented by a power spectrum vnoise. The set of power spectra v1, . . . , vn and vnoise is the “source model”. The propagation of the sound signals, for example in a room is represented by a covariance matrix, having spatial covariances R1, . . . , Rn, Rnoise for each power spectra v1, . . . , vn and vnoise. The set of spatial covariances R1, . . . , Rn, Rnoise is the “source propagation model”. The source model and the source propagation model represent the sources parameters. The sound signals emitted by the sources and the ambient noise are mixed and the mixture Rnoisevnoise+Σj=1JRjvj is captured by a microphone array. Sound separation systems usually comprise a plurality of microphones configured for capturing the audio mixture. The microphones of the plurality are disposed according to a spatial configuration referred to as an “array”. The microphone array records the mixture of audio signals from the audio sources. Each audio signal is predominant relatively to the others at some time-frequency bins when the corresponding source “speaks”. Also, the microphones are situated at different positions. In addition, the audio signals received by the microphones are characterized by respective spectral contents of the audio signal emitted.
The mixture is represented by a sum of elementary signals. The sound source separation aims at recovering the elementary signals.
In the present case of sound source separation, the “causes” are the signal of each sound source, and the available information is:                the sound mixture (which can be measured by the microphone array) and        other additional cues about sound sources characteristics (e.g. the location of the sources, the spectrum—which can be learned through training, etc.).        
This available information can be used to retrieve source signals only if there are means for associating the causes with consequences. A model is needed between source signals and the corresponding mixture, i.e. the sound propagation model.
Therefore, in addition to the mixture, the variance-based methods require (as explained hereinafter with reference to FIG. 1b):                A sound propagation model (a),        Some additional cues, which are used to initialize the model (b), and        Some additional optimization steps to refine the corresponding results (c), because initialization is usually not perfect.        
In the case of variance-based methods, the sound source propagation model is usually a robust variance representation.
FIG. 1b is a general flowchart of steps for separating said signals in the context of FIG. 1a. 
In an initialization step 100 the model (comprising the source model and the source propagation model) is initialized. The initialization step may be seen as a “first guess”. The aim is to start from source signals which are not too far from the real source signals that are to be separated from the mixture. This first guess is obtained from the cues. In order to obtain the cues, a training step 101 is performed. Training methods usually consists in recording each source individually, thereby extracting the “signature” (the spectrum) of each source.
The signals are first separated based on the model as initialized during a step 102. Next, an iterative optimization process 103 takes place.
The optimization is needed because the “first guess” initialization does not lead to the source signals to be obtained (separated). Also, the initialization leads to an estimated mixture (comprising the first separated signals) which is different from the real recorded mixture. Optimization techniques are used for modifying this first guess so as to obtain an estimated mixture which is closer to the real mixture. By doing so, separated signals are obtained which are closer to the real signals measured.
The mixture is measured by the microphone array and optimized source parameters are computed during a step 104. The optimized source parameters are then fed to the model and so on during the iterative process.
When the model has “converged” to an acceptable model, the final source parameters are post-processed during a step 105 in order to obtain the final separated sound source signals.
A popular optimization method for (c) is the “expectation-maximization” (or “EM” method). It consists in using an iterative mathematical optimization method which modifies the signals so as to get more probable signals at each step, until it converges. This optimization method leads to a realistic separation only if the initialization is not too far from the real signal. Otherwise, it converges to an irrelevant mixture. It means that the efficiency of the initialization step is crucial for the quality of the separation, and it needs robust cues (this method is well known to the skilled person).
The cues are obtained through training, which leads to the extraction of some source information. This helps initializing some parts of the signal, but, due to the nature of the signal itself, only a part of the signal can be initialized through training.
Mixture and sources are represented as a set of elementary time-frequency elements, also known as time-frequency bins (f,n), wherein f represents the frequency and n time. When it is dealt with time-frequency bins, the notation n) is used. Notation fn may also be used as an abbreviation. Each individual source j to be separated is represented by a signal yj,fn for the frequency f and the time n. The mixture xfn represented is represented for frequency f and time n by the sum of all sources (the noise is considered as a source):xfn=Σj=1Jyj,fn.
Each signal variance yj,fnyj,fnH can be split into 2 parts (or matrices in the mathematical representation), one time-independent part Rj,f, and one time-dependent part vj,fn:
yj,fnyj,fnH=Rj,fvj,fn, where Rj,f is the spatial covariance matrix and vj,fn is the power spectrum.
The time-dependent part vj,fn can be further split into a physically meaningful representation, i.e. into three different parts:
                    y                  j          ,          fn                    ⁢              y                  j          ,          fn                H              =                            R                      j            ,            f                          ⁢                  v                      j            ,            fn                              =                                                  R                              j                ,                f                                      ⊙            Global                    ⁢                                          ⁢                      spectrum            ⊙            Instantaneous                    ⁢                                          ⁢                      spectum            ⊙            Activity                          =                              R                          j              ,              f                                ⁢                      F                          j              ,              f                                ⁢                      W                          j              ,              fn                                ⁢                      T                          j              ,              n                                            ,
where ⊙ corresponds to the entry-wise matrix multiplication.
Thus, the signal variance can be defined through four elements:                the Rj,f part which corresponds to the sound propagation effect, which depends only on source position relatively to the microphone position. This does not depend on time.        the “Global spectrum” part Fj,f which corresponds to the intrinsic spectrum of the source, which is the source signature. It depends only on the source identity, not on the content of the signal itself. This does not depend on time.        the “Instantaneous spectrum” part Wj,fn which corresponds to the instantaneous spectrum related to the content of the signal itself. It changes continuously as a source is active. It depends on time.        the “Activity” part Tj,n which corresponds to the instantaneous energy of the signal emitted by a source. It depends on time.        
A classical training consists in recording each source individually. However, the content of a signal of a source during training is different from the content of the signal of the same source in the mixture to be processed for separation. It means that for each source j, the variance Rj,fvj,fn training during training is different from the variance Rj,fvj,fn mixture in the mixture. Since Rj,f is constant, this also means that the power spectra vj,fntraining during training is different from the power spectra vj,fnmixture in the mixture. Said differently, the power spectra vj,fn is not the same during training and in the mixture. Only the constant part Rj,f (not depending on time) can be determined through training, i.e. the spatial and global spectrum parts. The power spectra vj,fn is initialized randomly.
Variance-based methods require training. However, the training-based initialization approach suffers from several drawbacks.
The initialization is incomplete since only 2 of the 4 parts of the signal are initialized based on real cues (the two parts Rj,f and Fj,f which are independent from time). The other two parts are randomly initialized.
Also, it is not practical since there is a need for recording the sources individually, without any other source or noise. In many cases, including video-surveillance, this may not be possible at all.
Thus, there is a need for enhanced sound source separation techniques, in particular in the context of video surveillance.
The present invention lies within this context.