1. Technical Field
The present inventive concept relates generally to computational linguistics and embodiments thereof more specifically relate to information extraction of documents.
2. Discussion of the Related Art
Information extraction (IE) is a process or set of processes by which structured information is extracted from structured or semi-structured machine-readable documents. Modern IE systems and tools typically employ elements of natural language processing (NLP) to, in the case of text data, identify linguistic elements in a collection of documents, analyze the identified elements against a set of rules and extract meaningful information from the analysis results. The extracted information is used in, for example, database/knowledgebase querying, language translation, business analytics, and numerous other applications. Construction of such an NLP mechanism typically involves lengthy development times as well as the services, and corresponding costs, of human NLP experts.
Many IE tools such as text analytic annotators are universal or generic and can be used across domains of discourse where little more than common, everyday language is expected. However, in domains where the grammar and/or vocabulary becomes more specialized, such as in the fields of healthcare, law, finance, scientific research and others, constructing reusable or universally applicable annotators becomes more challenging. The context in which field-specific terminology is used, the manner in which organizationally-internal nomenclature extends and/or deviates from standard terminology, the structuring of text in different documents, etc. can vary substantially across organizations operating within the same domain. Typically, these issues have been addressed by the aforementioned NLP experts through a process of studying a sample set of documents, constructing abstractions based on the study by which field- and organization-specific terminology usage can be resolved and then manually generating and/or tuning processor code from which processor-executable text analytic annotators are produced. Not only does this process require expert personnel, it is labor intensive and results in annotators that have only limited use, if any, outside of the organization for which they were tuned.