Non-negative matrix factorization (NMF) has been described as a positive matrix factorization, see Paatero, “Least Squares Formulation of Robust Non-Negative Factor Analysis,” Chemometrics and Intelligent Laboratory Systems 37, pp. 23-35, 1997. Since its inception, NMF has been applied successfully in a variety of applications, despite a less than rigorous statistical underpinning.
Lee, et al, in “Learning the parts of objects by non-negative matrix factorization,” Nature, Volume 401, pp. 788-791, 1999, describe NMF as an alternative technique for dimensionality reduction. There, non-negativity constraints are enforced during matrix construction in order to determine parts of human faces from a single image.
However, that system is restricted within the spatial confines of a single image. That is, the signal is strictly stationary. It is desired to extend NMF for time series data streams. Then, it would be possible to apply NMF to the problem of source separation for single channel inputs.
Non-Negative Matrix Factorization
The conventional formulation of NMF is defined as follows. Starting with a complex non-negative M×N matrix Vε≧0,M×N, the goal is to approximate the matrix V as a product of two simple non-negative matrices Wε≧0,M×R and Hε≧0,M×N, where R≦M, and an error is minimized when the matrix V is reconstructed approximately by W·H.
The error of the reconstruction can be measured using a variety of cost functions. Lee et al., use a cost function:
                              D          =                                                                                    V                  ⊗                                      ln                    ⁡                                          (                                              V                                                  W                          ·                          H                                                                    )                                                                      -                V                +                                  W                  ·                  H                                                                    F                          ,                            (        1        )            where ∥·∥F is the Frobenius norm, and {circle around (×)} is the Hadamard product, i.e., an element-wise multiplication. The division is also element-wise.
Lee et al., in “Algorithms for Non-Negative Matrix Factorization,” Neural Information Processing Systems 2000, pp. 556-562, 2000, describe an efficient multiplicative update process for optimizing the cost function without a need for constraints to enforce non-negativity:
                              H          =                      H            ⊗                                                            W                  ⊤                                ·                                  V                                      W                    ·                    H                                                                                                W                  ⊤                                ·                1                                                    ,                  W          =                      W            ⊗                                                            V                                      W                    ·                    H                                                  ·                                  H                  ⊤                                                            1                ·                                  H                  ⊤                                                                    ,                            (        2        )            where 1 is an M×N matrix with all its elements set to unity, and the divisions are again element-wise. The variable R corresponds to the number of basis functions to extract. The variable R is usually set to a small number so that the NMF results into a low-rank approximation.
NMF for Sound Object Extraction
It has been shown that sequentially applying principle component analysis (PCA) and independent component analysis (ICA) on magnitude short-time spectra results in decompositions that enable the extraction of multiple sounds from single-channel inputs, see Casey et al., “Separation of Mixed Audio Sources by Independent Subspace Analysis,” Proceedings of the International Computer Music Conference, August, 2000, and Smaragdis, “Redundancy Reduction for Computational Audition, a Unifying Approach,” Doctoral Dissertation, MAS Dept., Massachusetts Institute of Technology, Cambridge Mass., USA, 2001.
It is desired to provide a similar formulation using NMF.
Consider a sound scene s(t), and its short-time Fourier transform arranged into an M×N matrix:
                              F          =                      DFT            ⁡                          [                                                                                          s                      ⁡                                              (                                                  t                          1                                                )                                                                                                                        s                      ⁡                                              (                                                  t                          2                                                )                                                                                                                                                                                                                          s                      ⁡                                              (                                                  t                          N                                                )                                                                                                                                  ⋮                                                        ⋮                                                        ⋯                                                        ⋮                                                                                                              s                      ⁡                                              (                                                                              t                            1                                                    +                          M                          -                          1                                                )                                                                                                                        s                      ⁡                                              (                                                                              t                            2                                                    +                          M                          -                          1                                                )                                                                                                                                                                                                                          s                      ⁡                                              (                                                                              t                            N                                                    +                          M                          -                          1                                                )                                                                                                        ]                                      ,                            (        3        )            where M is a size of the discrete Fourier transform (DFT), and N is a total number of frames processed. Ideally, some window function is applied to the input sound signal to improve the spectral estimation. However, because the window function is not a crucial addition, it is omitted for notational simplicity.
From the matrix FεM×R, the magnitude of the transform V=|F|, i.e., Vε≧0,M×R can be extracted, and then, the NMF can be applied.
To better understand this operation, consider the plots 100 of a spectrogram 101, spectral bases 102 and corresponding time weights 103 in FIG. 1. The plot 101 on the lower right is the input magnitude spectrogram. The plot 101 represents two sinusoidal signals with randomly gated amplitudes. Note, that the signals are from a single source, or monophonic signal.
The two columns of the matrix W 102, interpreted as spectral bases, are shown in the lower left. The rows of H 103, depicted in the top, are the time weights corresponding to the two spectral bases of the matrix W. There is one row of weights for each column of bases.
It can be seen that this spectrogram defines an acoustic scene that is composed of sinusoids of two frequencies ‘beeping’ in and out in some random manner. By applying a two-component NMF to this signal, the two factors W and H can be obtained as shown in FIG. 1.
The two columns of W, shown in the lower left plot 102, only have energy at the two frequencies that are present in the input spectrogram 101. These two columns can be interpreted as basis functions for the spectra contained in the spectrogram.
Likewise the rows of H, shown in the top plot 103, only have energy at the time points where the two sinusoids have energy. The rows of H can be interpreted as the weights of the spectral bases at each time instance. The bases and the weights have a one-to-one correspondence. The first basis describes the spectrum of one of the sinusoids, and the first weight vector describes the time envelope of the spectrum. Likewise, the second sinusoid is described in both time and frequency by the second bases and second weight vector.
In effect, the spectrogram of FIG. 1 provides a rudimentary description of the input sound scene. Although the example in FIG. 1 is simplistic, the general method is powerful enough to dissect even a piece of complex piano music to a set of weights and spectral bases describing each note played and its position in time for that note, effectively performing musical transcription, see Smaragdis et al., “Non-Negative Matrix Factorization for Polyphonic Music Transcription,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2003, and U.S. patent application Ser. No. 10/626,456, filed on Jul. 23, 2003, titled “Method and System for Detecting and Temporally Relating Components in Non-Stationary Signals,” incorporated herein by reference.
The above described method works well for many audio tasks. However, that method does not take into account relative positions of each spectrum, thereby discarding temporal information. Therefore, it is desired to extend the conventional NMF so that it can be applied to multiple time series data streams so that source separation is possible from single channel input signals.