Decomposing multi-way, non-negative data is currently a field of much interest and study due to their intrinsic rich structures and natural appearance in many real-world datasets. In document clustering, the data can be represented as a three-way dataset as author×terms×time. In email communications, the data can be represented as sender×receiver×time. In web page personalization, the data can be represented as user×query word×webpage. In high-order web link analysis, the data can be represented as a three-way dataset as web page×web page×anchor text. Instead of performing a traditional matrix decomposition by unwrapping the tensor into multiple two-dimensional matrices, which assumes only pair-wise relationships between two dimensions, tensor decomposition methods consider the more complex relationships that exist among all of the multiple dimensions.
Non-negative Matrix Factorization (NMF) techniques, developed for applications in linear algebra, are mainly used in pattern recognition and dimensionality reduction. It performs singular value decomposition with non-negative constraints. The NMF fitting algorithm minimizes the Euclidean distance (the least square error) or DL-divergence (I-divergence) between the original matrix and the reconstructed matrix by using multiplicative update rules to ensure the non-negativity. Probabilistic Latent Semantic Analysis (PLSA), as has been developed for statistics, to decompose non-negative data, uses latent class models or aspect models to perform a probabilistic mixture decomposition. PLSA is often used in natural language processing, information retrieval, and text mining related areas. NMF and PLSA can be naturally extended on multi-way non-negative data, called Non-negative Tensor Factorization (NTF) and Tensorial Probabilistic Latent Semantic Analysis (T-PLSA) respectively. NTF and T-PLSA are multi-dimensional tensor factorization techniques that can be applied to tensor decomposition. NTF and T-PLSA analysis techniques each have different advantages and costs. Designers of multi-dimensional cluster identification processing systems and methods often have to choose one analysis technique over the other and accept the inherent tradeoffs.
Accordingly, what is needed in this art are increasingly sophisticated systems and methods for identifying clusters within data sets based upon multi-dimensional relationships and for analyzing the probabilistic relationships between documents and document content.