This invention generally relates to term recognition, and more specifically, to recognizing new terms of a specific type based on the structure of known terms for that type.
Recognizing occurrences of domain specific terms and their types is important for many text processing applications. This problem is not easy, particularly in domains like medicine, where very rich terminology is generated by domain experts on a daily basis.
In spite of the large interest in statistical term recognition in Natural Language Processing (NLP), state of the art approaches for term recognition in the medical domain are still based on dictionary lookup with some heuristics for partial mapping. In fact, very large terminological resources, such as the Unified Medical Language System (UMLS), have been developed in the medical domain. The reason is that medical terminology cannot be identified by looking at superficial features only, such us capitalization of words, prefixes and suffixes. In fact, diseases names, symptoms and most medical terms are not proper names, so they are not capitalized. In addition, they are usually characterized by a rather complex internal structure and composed by many words. In addition, distributional similarity metrics, i.e. recognition approaches based on the analysis of the local context where the term is located, work well when applied to single words or very frequent words, which is not the case for most of the medical terms we are interested in.
In the context of the research on adapting a question answering system to the medical domain, the term recognition problem is encountered in many places. For example, recognizing names of diseases, symptoms and treatments is necessary to answer most of the Doctor's Dilemma™questions (American College of Physicians, 2012), an evaluation benchmark we used to measure the ability to answer medical questions. To assess the validity of the answer “HFE hereditary hemochromatosis” with respect to the question “Metabolic condition that can set off airport metal detector”, it is important to know that the answer is a type of metabolic condition. One way to address this problem is to use medical lexica where different terms are associated to semantic types, and then check whether the type of the candidate answer matches the type required by the question.
UMLS is far from complete. Many disease names (especially multi-words) are not recognized by dictionary lookup approaches. In many cases, specific terms are missing (e.g. “HFE hereditary hemochromatosis” in the question above).