The present embodiments relate to text mining a collection of documents. More specifically, the embodiments relate to integrating dictionary management with an associated text mining system.
Text mining is a technology utilized for understanding a large amount of non-structured text data without necessarily reading the entire content of associated documents. More specifically, text mining is a process of analyzing collections of textual materials in order to capture key concepts and themes and uncover hidden relationships and trends without requiring knowledge of precise words or terms used by associated authors to express those concepts. Text mining identifies concepts, patterns, topics, keywords, and other attributes in the data.
Text mining extracts linguistic facets, which are sets of words and phrases representing features of documents. Facets correspond to properties of information elements. Facets are significant aspects of documents; facets are derived from either metadata that is already structured or from concepts that are extracted from textual content. For example, facets may include people, places, organizations, sentiment analysis, etc. Facets are often derived by analysis of text of an item using entity extraction techniques or from pre-existing fields in a database, such as author, descriptor, language, and format. In a content analytics collection, facets are selected to explore analyzed content and discover patterns, trends, and deviations in data over time. Determining which facets are displayed and what contributes to each facet is a critical design task for successful content mining.
Conventional techniques for text mining utilize an external editor to manage facets and application of facets to a dictionary associated with the text mining. These techniques have limitations in that the process for added a word found in the text mining process requires a rebuilding of an associated index to check if the added words function well with the text mining.