One of the fundamental problems in natural language processing (NLP) is to learn meaning (at word, phrase, sentence or discourse level). Often, one would like to learn meaning or semantics in a data-driven fashion, possibly in an unsupervised manner. Deriving meaning from linguistic units has immense benefit in tasks such as information retrieval, machine translation (concept-based) and deeper analysis of texts for various business related decision making or troubleshooting. More recently, semantics have become important to glean meaning from Big Data such as customer reviews, tweets, user comments, etc.
A popular way to infer semantics in an unsupervised manner is to model a document as a mixture of latent topics. Several latent semantic analysis schemes have been used to good success in inferring the high level meaning of documents through a set of representative words (topics). However, the notion of a document has changed immensely over the last decade. Users have embraced new communication and information media such as short messaging service (SMS), Twitter®, Facebook® posts and user comments on news pages/blogs in place of emails and conventional news websites. Document sizes have been reduced from a few hundred words to a few hundred characters while the amount of data has increased exponentially.
There is therefore a need in the art for a technique to create an unsupervised topic model for short texts. There is furthermore a need in the art for a reliable topic model for large numbers of short texts.
There is additionally a need in the art for a technique to reliably identify latent topics in a topic model for large numbers of short texts. The need extends to a technique that is language agnostic.
There is furthermore a need in the art for an unsupervised phrase induction scheme that uses minimum description length to automatically learn phrases.