Data mining broadly seeks to expose patterns and trends in data, and most data mining techniques are sophisticated methods for analyzing relationships among highly formatted data, i.e., numerical data or data with a relatively small fixed number of possible values. However, much of the knowledge associated with an enterprise consists of textually-expressed information, including databases, reports, memos, e-mail, web sites, and external news articles used by managers, market analysts, and researchers.
In comparison to data mining, text data analysis (also sometimes called text mining or text analysis), refers to the analysis of text and may involve such functions as text summarization, information visualization, document classification and clustering (e.g., routing and filtering), document summarization, and document cross-referencing. Text data analysis may help a knowledge worker find relationships between individual unstructured or semi-structured text documents and semantic patterns across large collections of such documents. For example, U.S. Pat. Nos. 6,611,825 and 6,701,305 describe particular text data analysis methods. Text data analysis sometimes is a supporting aspect of data mining, but the concepts can be used independently as separate information retrieval methods, or together, such as to provide a data mining application that incorporates the ability to analyze text data.
Once a suitable set of documents and terms has been defined for a document text collection, various document retrieval techniques can be applied to the collection, such as keyword search methods, natural language understanding methods, probabilistic methods, and vector space methods.
Results of document retrieval techniques are typically presented as lists of documents, typically related to search terms. Often the list of related documents is sorted by relevancy to search terms and provides linked URL references for the knowledge worker to explore the full document. Lists of related documents also are often supplemented by extracts of the documents containing “hits” of the search terms (text summarization), helping the knowledge worker identify the context and uses of the search terms in the documents. However, a knowledge worker presented with a list of related documents from a document retrieval application only has the benefit of any relevancy ordering or, if a document retrieval application is supplemented with text data analysis, a knowledge worker only has the additional benefit of such supplemented techniques as content ordering, text summarization, or classification for the document list and extracts from the documents to determine which documents to explore. Unfortunately, this type of typical searching process only provides a knowledge worker with limited information, and sometimes misleading information. For example, documents that include one or more high frequency terms may receive a misleadingly good relevancy score and be elevated in the result list even though those documents include few, if any, of the other terms of the query. Many variations on this general searching process have been proposed or developed, such as weighting various terms and reducing the impact of high-frequency terms. Regardless of the improvements of the algorithms or presentation features, the knowledge worker remains limited by underlying algorithms and, particularly, the presentation of the document results, typically a list of URL references with exemplary document extracts. Similarly, results of text data analysis are often presented simply by versioning control that identifies editing and other differences between two documents and does little to help a knowledge worker analyze the content of one or more documents.