The present disclosure relates generally to language processing, and more specifically, to computationally forming a domain-specific computational lexicon from an unstructured domain glossary.
Parsers are a fundamental stepping stone to many different types of natural-language processing (NLP) applications and tasks. One such type of system that relies upon a parser is a question-answering computer system. Question-answering computer systems typically employ NLP to return a highest scoring answer to a question. NLP techniques, which are also referred to as text analytics, infer the meaning of terms and phrases by using a parser to analyze syntax, context, and usage.
Human language is so complex, variable (there are many different ways to express the same meaning), and polysemous (the same word or phrase may mean many things in different contexts) that NLP presents an enormous technical challenge. Decades of research have led to many specialized techniques each operating on language at different levels and on different isolated aspects of the language understanding task. These techniques include, for example, using a natural-language (NL) parser for parsing of data sources to build a knowledge base and analyze questions, candidate answers, and supporting evidence in the context of answering questions. In order for a NL parser to operate, a lexicon must be developed that associates syntactic and semantic information with its entries. Thus, for a NL parser to be effective, a well-formed and populated (broad and rich) lexicon is needed.
A parser needs, minimally, a lexicon for its common, domain-independent vocabulary, such as the common words in the English language. Then, additionally, the parser can be adapted—by extending its lexicon—to include domain-specific terms, such as all diseases, symptoms and treatments (for a medical domain), or chemical compounds and processes (for a natural science domain), and so forth. A domain-specific lexicon can include a collection of terms from a glossary; however, such a lexicon without syntactic and/or semantic information may not provide a sufficient level of information to drive a parser. A computational lexicon is a lexicon that includes syntactic and/or semantic information. One approach to forming a domain-specific lexicon is to analyze documents to identify technical terminology, which can include multiple or single word sequences augmented with little (if any) syntactic and/or semantic information. This may be sufficient for compiling glossaries or mapping out new domains for NLP applications. However, such terms are, a priori, not informative enough to be useful to a NL parser. This is because external lexical resources (such as pre-compiled glossaries, or simply lists of expressions culled from domain-specific texts) typically have little or no lexical information for individual terms. Thus, a large amount of manual annotation may be needed to create a domain-specific computational lexicon suitable for use by a NL parser.