Millions of digital documents are being created and stored each day, ranging in topic from essays to financial transaction logs to personal health histories to patent applications, and thousands of other topics besides. Many organizations have access to large quantities of documents created and stored for a variety of purposes. Unfortunately, these documents aren't always categorized in a useful and sensible manner. Finding documents related to a particular topic may be difficult when searching through a data store of thousands of documents that may not be indexed, categorized, or summarized.
Topic mining is an activity that results in the extraction of topics from an unstructured data artifact such as a document. Because documents are typically a loosely structured sequence of words and other symbols, the problem is non-trivial. Many traditional topic mining systems may be based on coarse-grained techniques that need to operate on a large number of documents in order to group the documents into multiple clusters where each cluster represents a particular latent topic. This is an expensive process; moreover, traditional systems may not assign human-readable topic names to the clusters. Accordingly, the instant disclosure identifies and addresses a need for additional and improved systems and methods for determining topics of data artifacts.