The present disclosure is generally directed to data processing and more particularly to suggesting patterns in data. Still more particularly, the present disclosure is directed to techniques for suggesting patterns in unstructured documents.
Unstructured data (unstructured information) usually refers to information that either does not have a predefined data model or is not organized in a predefined manner. Unstructured information is typically heavy on text and may include other data, e.g., dates and numbers. The wide variations in unstructured information make unstructured information difficult to interpret using traditional computer programs, as compared to data stored in field form in databases or data that is annotated (e.g., semantically tagged) in documents. It has been estimated that between eighty to ninety percent of all potentially usable business information originates in unstructured form and that unstructured information accounts for seventy to eighty percent of all organizational data.
Techniques such as data mining, natural language processing (NLP), and text analytics have been employed to locate patterns in unstructured information. A common technique for structuring text has involved manually tagging unstructured information with metadata. Unstructured Information Management Architecture (UIMA) provides a common framework for processing unstructured information to extract meaning and create structured data about the unstructured information. Software that creates machine-processable structure usually exploits linguistic structure that is inherent in all forms of human communication. Algorithms can infer inherent structure from text, for example, by examining word morphology, sentence syntax, and other small-scale patterns and large-scale patterns. Unstructured information can be tagged to address ambiguities and relevancy-based techniques may then be used to facilitate search and discovery. Examples of unstructured data include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text, e.g., the body of an email message, a Web page, or a word processing document.