The present invention relates generally to the field of natural language processing, and more particularly to “term extraction.”
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding (that is, enabling computers to derive meaning from human or natural language input).
Information Extraction (IE) is a known element of NLP. IE is the task of automatically extracting structured information from unstructured (and/or semi-structured) machine-readable documents. Term Extraction is a sub-task of IE. The goal of Term Extraction is to automatically extract relevant terms from a given text (or “corpus”). Term Extraction is used in many NLP tasks and applications, such as question answering, information retrieval, ontology engineering, semantic web, text summarization, document classification, and clustering. Generally, in term extraction, statistical and machine learning methods may be used to help select relevant terms.
Domain ontologies are known. A domain ontology represents concepts which belong to a particular “domain” such as an industry or a genre. In fact, multiple domain ontologies may exist within a single domain due to differences in language, intended use of the ontologies, and different perceptions of the domain. However, since domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible. In the context of NLP, term extraction becomes difficult when the text being processed belongs to a different domain (for example, medical technology) than the domain from which the NLP software was built (for example, financial news).