Non-negative matrix factorization (NMF) is commonly used for challenging single-channel audio source separation tasks, such as speech enhancement in the presence of non-stationary noises. In this context, the idea is to represent features of the source signals as sets of basis functions and their activation coefficients, one set per source signal. Mixtures of the source signals are then analyzed using the concatenated sets of basis functions, and each source signal is reconstructed using its corresponding activation coefficients and the set of basis functions.
NMF operates on a matrix of F-dimensional non-negative spectral features, usually a power or magnitude spectrogram of the mixture, M=[m1 . . . mT], where T is the number of frames and mt∈+F, t=1, . . . , T are obtained by short-time Fourier analysis of the time-domain signal.
For the general case of separating S sources, a set of Rl non-negative basis vectors w1l, . . . , wRll is assumed for each source l∈{1, . . . , S}, and concatenated into matrices Wl=[w1l . . . wRll]. From this, a factorizationM≈WH=[W1 . . . WS][H1; . . . ;HS]  (1)is obtained, where we use the notation [a;b]=[aTbT]T and aT denotes the transpose of a.
An approach related to Wiener filtering is typically used to reconstruct each source, while ensuring that the source estimates sum to the mixture:
                                                        S              ^                        l                    =                                                                      W                  l                                ⁢                                  H                  l                                                                              ∑                  l                                ⁢                                                                  ⁢                                                      W                    l                                    ⁢                                      H                    l                                                                        ⊗            M                          ,                            (        2        )            where  denotes element-wise multiplication and the quotient line element-wise division. Wl can be determined in advance from training data, and at run time only the activation matrices Hl=[h1l . . . hTl], where htl∈+Rl, are estimated. This is called supervised NMF. In the supervised case, the activations for each frame are independent from the other frames (mt≈ΣlWlhtl). Thus, source separation can be performed on-line and with latency corresponding to the window length plus the computation time to obtain the activations for one frame.
Another operating mode for NMF-based source separation is semi-supervised NMF, in which the Wl for only some of the sources l∈Ltrained are determined in advance from training data, and at run time both the basis matrices Wl for other sources l∉Ltrained, and the activation matrices Hl for all sources (l∈Ltrained and l∉Ltrained) are estimated.
At test time, supervised NMF determines the optimal activations H such that
                                          H            ^                    =                                    [                                                                    H                    ^                                    1                                ;                …                ⁢                                                                  ;                                                      H                    ^                                    S                                            ]                        =                                                            argmin                  H                                ⁢                                  D                  ⁡                                      (                                          M                      |                      WH                                        )                                                              +                              μ                ⁢                                                                          H                                                        1                                                                    ,                            (        3        )            where D is a cost function that is minimized when M=WH. For example, typical choices for D include the β-divergence Dβ, which for β=1, yields the generalized Kullback-Liebler (KL) divergence, and for β=2, yields the Euclidean distance. An L1 sparsity constraint with weight μ p is added to favor solutions where few basis vectors are active at a time. A convenient procedure for minimizing equation (3) that preserves non-negativity of H by multiplicative updates is given by iterating
            H              (                  q          +          1                )              =                  H                  (          q          )                    ⊗                                    W            T                    ⁡                      (                          M              ⊗                                                (                                      Λ                                          (                      q                      )                                                        )                                                  β                  -                  2                                                      )                                                                              W                T                            ⁡                              (                                  Λ                                      (                    q                    )                                                  )                                                    β              -              2                                +          μ                      ,          ⁢      0    ≤    q    <    Q  until convergence, with Λ(q):=WH(q), where the superscripts (q) and (q+1) indicate iterates, and Q≧1 gives the maximum number of iterations. H0 is initialized randomly.
Because sources often have similar characteristics in the short-term observations (such as unvoiced phonemes and broadband noise, or voiced phonemes and music), it is beneficial to use information from multiple time frames. This can be performed by stacking features: the observation m′t at time t corresponds to the observations [mt-TL; . . . ; mt; . . . ; mt-TR] where TL and TR are the left and right context sizes. Analogously, each basis element w′kl models a sequence of spectra, stacked into a column vector. For readability, the ′ is subsequently dropped.
Obtaining NMF Bases
A common approach to obtain bases Wl that fit an NMF model WlHl to the spectrograms of source signals Sl is by minimizing a reconstruction objective Dβ(Sl|WlHl). To learn overcomplete bases, one can use sparse NMF (SNMF), performing the minimization
                                          W            _                    l                ,                                            H              _                        l                    ←                                                    argmin                                                      W                    l                                    ,                                      H                    l                                                              ⁢                                                D                  β                                ⁡                                  (                                                            S                      l                                        |                                                                                            W                          ~                                                l                                            ⁢                                              H                        l                                                                              )                                                      +                          μ              ⁢                                                                                      H                    1                                                                    1                                                                        (        4        )            for each l, where
            W      ~        l    =      [                            w          1          l                                                w            1            l                                        ⁢                          ⁢      …      ⁢                          ⁢                        w                      R            l                    l                                                w                          R              l                        l                                          ]  is the column-wise normalized version of Wl.
Because the L1 sparsity constraint on H is not scale-invariant, the constraint by itself can be minimized by scaling the factors. By including the normalization on Wl in the cost function, the scale indeterminacy can be avoided. This is not the same as performing conventional NMF optimization and scaling one of the factors to unit norm after each iteration, which is often the way sparsity is implemented in NMF, and which is denoted hereafter by NMF+S. A multiplicative update procedure can be used to optimize equation (4) for arbitrary β≧0.
As an alternative to sparse NMF training, exemplar-based approaches, where every basis function corresponds to an observation of the source l in the training data, are often used in practice for large-scale factorizations of audio signals.
Discriminative Approach to NMF
In the model underlying the above source separation process, separately trained source models are concatenated to yield a model of the mixture. This comes with the benefit of modularity. Models of different sources can be substituted for one another without having to train the whole system. However, this type of model also has a fundamental flaw. The objectives in equation (3) and (4) used at test and training time are considerably different. The test-time inference objective (3) operates on a mixture while the training objective (4) operates on separated sources.
If there is spectral overlap in the bases of the different sources, which cannot be avoided in the general case, such as for speech/noise separation, the activations obtained using equation (3) are different than those obtained using equation (4). It is clear that (4) cannot be used at test time, because Sl is unknown. Hence, a discriminative approach can take into account the objective function from equation (3) at training time.
This involves having mixtures M, along with their ground truth separated sources Si available for parallel training. However, supervised NMF also assumes the availability of separated training signals for all sources, and assumes simple linear mixing of the sources at test time. Generating the mixtures from the training signals for parallel training requires no additional assumptions.
The following optimization problem is for training bases as discriminative NMF (DNMF):
                                          W            ~                    =                                    argmin              W                        ⁢                                          ∑                l                            ⁢                                                          ⁢                                                γ                  l                                ⁢                                                      D                    β                                    ⁡                                      (                                                                  S                        l                                            |                                                                        W                          l                                                ⁢                                                                                                            H                              ^                                                        l                                                    ⁡                                                      (                                                          M                              ,                              W                                                        )                                                                                                                )                                                                                      ,                                  ⁢        where                            (        5        )                                                                    H              ^                        ⁡                          (                              M                ,                                  W                  _                                            )                                =                                                    argmin                H                            ⁢                                                D                  β                                ⁡                                  (                                      M                    |                                                                  W                        ~                                            ⁢                      H                                                        )                                                      +                          μ              ⁢                                                                  H                                                  1                                                    ,                            (        6        )            and γl are weights accounting for the application-dependent importance of the source l. For example, in speech denoising, The focus is on reconstructing the speech signal, and the weight corresponding to the speech source can be set to 1 while the other weights are set to 0.
Equation (5) minimizes the reconstruction error given Ĥ. Equation (6) ensures that Ĥ are the activations that arise from the test-time inference objective. Note that, in equation (5), W does not need normalization. Given the bases W, the activations Ĥ(M, W) are uniquely determined, due to the convexity of equation (6).