The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for automatic evaluation and improvement of ontologies for natural language processing tasks.
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction, and especially with regard to natural language understanding that enables computers to derive meaning from human or natural language input.
Many NLP systems make use of ontologies to assist in performing NLP tasks. An ontology is a representation of knowledge. A semantic ontology, in the case of NLP, is a representation of knowledge of the relationships between semantic concepts. Created by humans, usually by domain experts, ontologies are never a perfect representation of all available knowledge. Often they are very biased to a particular subarea of a given domain, and often reflect the level of knowledge or attention to detail of the author. Ontologies are usually task inspired, i.e. they have some utility in terms of managing information or managing physical entities and their design reflects the task for which their terminology is required. Generally speaking, the tasks hitherto targeted have not been focused on the needs of applications for cognitive computing or natural language processing and understanding.
Ontologies are often represented or modeled in hierarchical structures in which portions of knowledge may also be represented as nodes in a graph and relationships between these portions of knowledge can be represented as edges between the nodes. Examples of structures such as taxonomies and trees are limited variations, but generally speaking, ontology structures are highly conducive to being represented as a graph.
Examples of such semantic ontologies include the Unified Medical Language System (UMLS) semantic network for the medical domain, RXNORM for the drug domain, Foundational Model of Anatomy (FMA) for the human anatomy domain, and the like. The UMLS data asset, for example, consists of a large lexicon (millions) of instance surface forms in conjunction with an ontology of concepts and inter-concept relationships in the medical domain.
Although semantic ontologies provide a mechanism for encoding human knowledge of relationships between semantic concepts, the degree to which these ontologies represent the entire scope of semantics in a target domain is questionable at best. Assessing how closely ontologies match the semantics of natural language text is extremely difficult to accomplish. This is especially true when one considers that many ontologies used in natural language processing (NLP) may not originally have been designed for such NLP tasks.
The problem of assessing ontologies is further exacerbated by the fact that few ontologies offer large enough scope to cater to an entire natural language domain, e.g., a medical domain, a financial domain, or the like. Thus, merging multiple ontologies has become commonplace. The collective semantics of multiple source ontologies can often overlap inconsistently and negotiation of the meaning in the ontologies, so that the associated set of concepts and relationships between concepts in the ontology remains balanced, is a critical task. However, the task of merging ontologies is usually reserved for groups of human domain experts who focus on those parts of the ontology in which they specialize. These human domain experts must painstakingly map each individual concept between component data sets, while attempting to ensure that semantic integrity is preserved. Coordinating the process of collaborative editing and merging of ontologies is a significant problem.
For example, UMLS is composed of approximately 100 different source ontologies or terminologies, each of which have their own labels, descriptions, and semantic perspective (e.g., FMA for the human body and RXNORM for drugs, which are both part of UMLS). The process of adapting this resource for use in NLP tasks, such as word-sense disambiguation, is problematic for the reasons noted above.