The management of large and growing collections of text information is a challenging problem. Data repositories of knowledge-rich text information have become widely accessible, leading to an overwhelming amount of data to organize, explore, and analyze. As the number of documents increases, learning the meaning of the text corpora becomes cognitively costly and time consuming.
The challenge of the automated summarization of large text corpora has been a primary area of interest for researchers in the Natural Language Processing (NLP) domain. In order to summarize a text corpus, researchers have developed techniques such as Latent Semantic Analysis (LSA) for extracting and representing the contextual-usage meaning of words. LSA produces a concept space which can be used for document classification and clustering. More recently, probabilistic topic models have emerged as a powerful new technique for finding semantically meaningful topics in an unstructured text collection. In order to further provide a visual summary of text corpora, researchers from the knowledge discovery and visualization community have developed tools and techniques to support the visualization and exploration of large text corpora based on both LSA and probabilistic topic models.
Although probabilistic topic models have demonstrated their advantages in interpretability and semantic association, few interactive visualization systems have leveraged such models to support the exploration and analysis of text corpora. The exemplar-based visualization and probabilistic latent semantic visualization approaches have projected documents onto static two-dimensional (2D) plots while estimating the topics of a text corpus. Although the clusters of documents conform well to the chosen labels, there is little opportunity for the interactive exploration and analysis of the document clusters. One exception is the time-based visualization system TIARA, which applies the ThemeRiver metaphor to visually summarize a text collection based on the topic content. Through analysis with the TIARA system, users can answer questions such as: what are the major topics in the document corpus? and how have the topics evolved over time?
However, when analyzing large text corpora, there are many other real-world questions that current text analysis visualization systems have difficulty answering. In particular, questions pertaining to the relationships between topics and documents are difficult to answer with existing tools. Such questions include: what are the characteristics of the documents based on their topical distribution? and what documents contain multiple topics at once (and what are they)? In the field of science policies, for example, documents with multiple topics could indicate publications that are interdisciplinary (i.e. that cover more than one body of knowledge). Similarly, in the context of social media analysis, a document with multiple topics may signify a unique news article that is relevant to different hot topics.
In order to overcome the shortcomings associated with prior methods and systems, and in order to help users more effectively make sense of large text corpora, the present invention provides a novel visual analytics system that integrates a state-of-the-art probabilistic topic model, Latent Dirichlet Allocation (LDA), with interactive visualization. In order to describe a corpus of documents, the methods and systems of the present invention first extract a set of semantically meaningful topics using LDA. Unlike most traditional clustering techniques, in which a document is assigned to a specific cluster, the LDA model accounts for different topical aspects of each individual document. This permits the effective full text analysis of larger documents that may contain multiple topics. In order to highlight this property of the model, the methods and systems of the present invention utilize the parallel coordinate metaphor to present the probabilistic distribution of a document across topics. Such a representation allows users to discover single-topic vs. multi-topic documents and the relative importance of each topic to a document of interest. In addition, since most text corpora are inherently temporal, the systems and methods of the present invention also depict the topic evolution over time.