Blind source separation refers to the process of separating a composite signal into its original component signals without prior knowledge of their characteristics. This process is useful in speech recognition, multipath channel identification and equalization, improvement of the signal to interference ratio (SIR) of acoustic recordings, in surveillance applications and in the operation of hearing aids.
Blind source separation of broad band signals in a multipath environment remains a difficult problem which has a number of ambiguities. Increasing the number of sensors allows improved performance but leads to ambiguities in the choice of the separating filters. There are in theory multiple filters that invert the responses in a room because there are multiple projections from the space containing microphone signals into the smaller space of signal sources. These multiple filters represent remaining degrees of freedom in terms of a sensor array response.
The consistent assignment of signal contributions to different source channels across different frequencies creates a frequency permutation problem. This problem is inherent to all source separation algorithms including time domain algorithms unless the algorithm simultaneously considers different frequency bands. An estimation of such polyspectral properties is particularly difficult for nonstationary signals such as speech and the resulting algorithms are computationally expensive.
The basic source separation problem is described by assuming the existence of M uncorrelated, time varying source signals source signals:s(t)εRM where R indicates the set of real numbers and the sources s(t) originate from different spatial locations. A number of sensors N (where N≧M) detect time varying signals:x(t)εRN.In a multipath environment each source j couples with sensor i through a linear transfer function Aij(τ), representing the impulse response of the corresponding source to sensor path such that:
            x      i        ⁡          (      t      )        =            ∑              j        =        1            M        ⁢                  ∑                  τ          =          0                          P          -          1                    ⁢                                    A            ij                    ⁡                      (            τ            )                          ⁢                              s            j                    ⁡                      (                          t              -              τ                        )                              wherein P is the length of the impulse response of the environment, measured samples.This equation can be rewritten using matrix notation (denoting the convolutions by *):X(t)=A(t)*s(t).After applying the discrete time Fourier transform, this equation may be rewritten as:X(ω)=A(ω)s(ω).The goal of convolutive source separation is to find finite impulse response (FIR) filters Wij(τ) that invert the effect of the convolutive mixing A(τ). This is equivalent to producingy(ω)=W(ω)x(ω)that correspond to the original sources s(t).
Different criteria for convolutive separation have been proposed, for example as discussed by H. -L. N. Thi and C. Jutten in “BLIND SOURCE SEPARATION FOR CONVOLUTIVE MIXTURES”, published in Signal Processing, vol. 45, no. 2, pp. 209-229 (1995). A two channel example is disclosed in U.S. Pat. No. 5,208,786 entitled MULTI-CHANNEL SIGNAL SEPARATION, issued to Weinstein et al. The '786 patent models each channel as a multi-input-multi-output (MIMO) linear time invariant system. The input source signals are separated and recovered by requiring that the reconstructed signals be statistically uncorrelated. However, the decorrelation condition is insufficient to uniquely solve the problem unless one assumes that the unknown channel is a 2×2 MIMO finite impulse response filter.
All convolutive separation criteria can be derived from the assumption of statistical independence of the unknown signals, usually limited to pairwise independence of the source signals. Pairwise independence implies that all cross-moments can be factored, thereby creating a number of necessary conditions for the model signal sources,∀t,n,m,τ,i≠j: E[yin(t)yjm(t+τ)]=E [yin(t)]E[yjm(t+τ)].  (Equation I)Convolutive separation requires that these conditions by satisfied for multiple delays X which correspond to the delays of the filter taps of W(τ). For stationary signals higher order criteria (multiple n, m) are required. For non-stationary signals such as speech multiple t can be used and multiple decorrelation (n=m=1) is sufficient.
When using an independence criterion there remains both a permutation and scaling ambiguity. In the convolutive case the scaling ambiguity applies to each frequency group or bin resulting in a convolutive ambiguity for each source signal in the time domain. Any delayed or convolved versions of independent signals remain independent. For the independent frequency domainE[yin(ω)yjm(ω)]=E[yin(ω)]E[yjm(ω)],there is a permutation ambiguity per each frequency for all orders n and m. For each frequency the independent frequency domain is therefore also satisfied by arbitrary scaling and assignment of indices i, j to the model sourcesW(ω)A(ω)=P(ω)S(ω),  (Equation II)where P(ω) represents an arbitrary permutation matrix and S(ω) an arbitrary diagonal scaling matrix for each frequency. This creates the problem that contributions of a given signal source may not be assigned consistently to a single model source for different frequency bins. A given model source will therefore have contributions from different actual sources. The problem is more severe with an increasing number of channels as the possible number of permutations increases.
This problem has often been considered an artifact of the frequency domain formulation of the separation criteria since the separation task is decoupled into independent separation tasks per frequency bin. For n=m=1 this ambiguity also applies to the time domain independence criteria set forth in Equation I. Even for higher orders the time domain criteria does not guarantee correct permutations.
Some source separation work in the past simply ignored the problem. Others have proposed a number of solutions such as continuity in the spectra of the model sources, or the fact that the different frequency bins are often co-modulated. A rigorous way of capturing these statistical properties of multiple frequency contributions are polyspectra. However, in practice it is difficult to obtain robust statistics at multiple frequencies, in particular for non-stationary signals such as speech. In addition, the algorithms that consider combinations of frequencies are inherently computationally very demanding. Smoothness constraints on the filter coefficients in the frequency domain have also been proposed, as for example in U.S. Pat. No. 6,167,417 entitled CONVOLUTIVE BLIND SOURCE SEPARATION USING A MULTIPLE DECORRELATION METHOD, issued to Parra et al. This is equivalent to constraining the length of the filter as compared to the size of the analysis window. However, this limitation on the filter size may not always be reasonable as rather long filters are required in strongly reverberant environments.
In theory only N sensors are needed to separate M=N sources. In practice, however, one may want to use more microphones (N>M) to improve the performance of a real system. Ignoring the permutation and scaling ambiguities, Equation II reads W(ω)A(ω)=I, where I represents the identity matrix. For a given A(w) there is a N−M dimensional linear space of solutions W(ω), indicating that there are additional degrees of freedom when shaping a beam pattern represented by the filters W(ω).
In conventional geometric and adaptive beamforming information such as microphone position and source location is often utilized. Geometric assumptions can be incorporated and implemented as linear constraints to the filter coefficients. In a multiple sidelobe canceler, for example, the response of one of the channels (channel i) is kept constant, which can be expressed as w(ω)ei=constant. The elements of the row vector w(ω)εCN are the filter elements to be applied to each microphone, and ei is the ith column of the identity matrix. This is similar to the normalization condition imposed on the diagonal terms of W that is conventionally applied in blind separation algorithms. Rather than constraining a channel one can also constrain the response of a beamformer for a particular orientation.
If the locations and response characteristics of each microphone is known, one can compute the free field response of a set of microphones and associated beam forming filters w(ω). For a position q, the phase and magnitude response is given byr(ω,q)=w(ω)d(ω,q),where d(ω,q)εCN represents the phase and magnitude response of the N microphones for a source located at q. For a linear array with omnidirectional microphones and a far field source (much beyond the array aperture squared over the wavelength of interest) the microphone response depends approximately only on the angle θ=θ(q) between the source and the linear array,d(ω,q)=d(ω,θ)=e−jω(pi/c)sin(θ),where pi is the position of the ith microphone on the linear array and c is the wave propagation speed.
Constraining the response to a particular orientation is simply expressed by the linear constraint on w(ω) such that r(ω,θ)=w(ω)d(ω,θ)=constant. This concept is used in the linearly constrained minimum variance (LCMV) algorithm and is also the underlying idea for generalized sidelobe cancelling. In order to obtain a robust beam it has also been suggested to require a smooth response around a desired orientation. In summary, all of these conditions or a combination of them can be expressed as linear constraints on w(ω).
Most adaptive beamforming algorithms consider power as their main criteria for optimization. Sometimes power is minimized such as in noise or sidelobe cancelling in order to adaptively minimize the response at the orientation of the interfering signals. Sometimes power is maximized, as in matched filter approaches, to maximize the response of interest. As a result, these algorithms often perform suboptimally when there is cross talk from other sources.
In second order source separation methods rather than considering the power of an individual beam w(ω)εC1×N and an individual channel y(t)εR1, one can consider powers and cross powers of multiple beams W(ω)εCM×N and their corresponding outputs y(t)εRM. In the frequency domain these multiple beams and outputs correspond to the cross power spectra Ryy(t, ω).
Second order blind source separation of nonstationary signals minimizes cross powers across multiple times. Off diagonal elements of the matrix Ryy(t, ω) are minimized in second order separation rather than the diagonal terms as is the case in the conventional adaptive beamforming. Strict one channel power criteria has a serious cross talk or leakage problem when multiple sources are simultaneously active, especially in reverberant environments.