This specification is directed generally to text mining. Text mining involves the processing of text data to make the text data suitable for one or more applications such as, for example, text classification, text clustering, information retrieval, and/or key phrase detection. Text mining attempts to represent unstructured text data in a structured format that can be processed meaningfully for one or more text mining applications.
One text representation is called the “Bag-of-Words” model or the “Vector Space” model. In the Bag-of-Words model and other models it may be desirable to identify phrases in one or more text mining applications in addition to or as an alternative to identifying individual words. Examples of identifying phrases include the use of n-gram based language models that compute raw statistics of total counts of all n-grams of sizes n=1, 2, . . . , N and convert the computed raw statistics into probabilities.