This specification is directed generally to text mining. Text mining involves the processing of text data to make the text data suitable for one or more applications such as, for example, text classification, text clustering, information retrieval, and/or key phrase detection. Text mining attempts to represent unstructured text data in a structured format that can be processed meaningfully for one or more text mining applications.
One text representation is called the “Bag-of-Words” model or the “Vector Space” model. In the Bag-of-Words model all unique words in a document are given a weight that is based on three factors: (a) the number of times the word occurred in the various fields of the document, (b) the importance of the field(s) in which the word occurred in the document, and/or (c) the document frequency of the word in the corpus (e.g., the number or fraction of documents in the corpus that contain the word at least once). For example, a word in a document will receive a higher relative weight in the Bag-of-Words model if it occurs more often, in more important fields in the document (document centric weight component) and rarely in the corpus (corpus centric weight component). Term Frequency Inverse Document Frequency (TFIDF) is an example of such a representation that combines the document centric weight component with the corpus centric weight component for each unique word in the document.