The present invention generally relates to document analysis, and more specifically, to inferring topic evolution and emergence in streaming documents.
Learning a dictionary of basis elements with the objective of building compact data representations is a problem of fundamental importance in statistics, machine learning and signal processing. In many settings, data points appear as a stream of high dimensional feature vectors. Streaming datasets present new twists to the problem. On one hand, basis elements need to be dynamically adapted to the statistics of incoming datapoints, while on the other hand, many applications require early detection of rising new trends. The analysis of social media streams formed by tweets and blog posts is a prime example of such a setting, where topics of social discussions need to be continuously tracked and new emerging themes need to be rapidly detected.
Consider the problem of building compact, dynamic representations of streaming datasets such as those that arise in social media. By constructing such representations, “signal” can be separated from “noise” and essential data characteristics can be continuously summarized in terms of a small number of human interpretable components. In the context of social media applications, this maps to the discovery of unknown “topics” from a streaming document collection. Each new batch of documents arriving at a timepoint is completely unorganized and may contribute either to ongoing unknown topics of discussion (potentially causing underlying topics to drift over time) and/or initiate new themes that may or may not become significant going forward, and/or simply inject irrelevant “noise”.
While the dominant body of previous work in dictionary learning and topic modeling has focused on solving batch learning problems, a real deployment scenario in social media applications truly requires forms of online learning. The user of such a system is less interested in a one-time analysis of topics in a document archive, and more in being able to follow ongoing evolving discussions and being vigilant of any emerging themes that might require immediate action. Several papers have proposed dynamic topic and online dictionary learning models (see [D. Blei and J. Lafferty, Dynamic topic models, in ICML, 2006; Tzu-Chuan Chou and Meng Chang Chen, Using Incremental PLSI for Threshold-Resilient Online Event Analysis, IEEE transactions on Knowledge and Data Engineering, 2008; A. Gohr, H. Hinneburg, R. Schult, and M. Spiliopoulou, Topic evolution in a stream of documents, in SDM, 2009; and J. Mairal, F. Bach, J. Ponce and G. Sapiro, Online learning for matrix factorization and sparse coding, JMLR, 2010] and references therein) that either exploit temporal order of documents in offline batch mode or are limited to handling a fixed bandwidth of topics with no explicit algorithmic constructs to attempt to detect emerging themes early.