Several methods exist for topic discovery within a corpus of documents. As an example, one could imagine applying such methods to all of the newspaper articles written in the United States during the nineteen sixties. In this example, the articles serve as the documents and, collectively, they form the corpus of documents. One would not be surprised to see such methods discover the Vietnam War, the Watergate scandal, the movement for civil rights, etc., as the pertinent topics for such a corpus.
The problem with conventional methods of automatic topic discovery is that they are too slow to be of use for near real-time applications, such as analyzing social media post to determine “hot” topics on-the-fly. The exact timescales required depend on the number of words in the lexicon, the number of documents and the corpus, and the number of desired topics. Stated another way, the dimensionality of the computational problem involved with automatic topic discovery is proportional to the size of the lexicon, which tends to be quite large (e.g., thousands of words). Hours, days, or even weeks of required processing time to automatically discover topics are not uncommon.