The embodiments described herein relate generally to language processing systems and, more particularly, to techniques for extracting ontological information from a body of text.
Information Extraction (IE) is the science of automating the extraction of information from unstructured or semi-structured documents. Known information extraction systems rely on natural language processing (NLP), and are traditionally implemented as a pipeline of special-purpose processing modules targeting the extraction of a particular kind of information. A major drawback of such an approach is that whenever a new extraction goal emerges or a module is improved, extraction has to be reapplied from scratch to the entire body of text even though only a small part of the text might be affected.
Other known information extraction systems rely on keyword search, which involves a set of keywords and a search mechanism as a way of locating information in text documents. However, the search mechanism relies on identifying specific words that appear in the documents without taking into account the meaning of the words. Traditional word-based approaches ignore syntactic and grammatical information present in the sentence as a whole.
Information extraction systems also rely on named-entity recognition. In analyzing documents, information extraction systems need to recognize and classify individual elements. Some known approaches to named-entity recognition involve use of a dictionary, a list of known individual elements and their pseudonyms. However, dictionaries are not always available for specific subject matter domains, such as for specific engine components or engine failure symptoms. Creating dictionaries that encounter all possible syntactic variations of technical concepts in a given subject matter domain can be a labor-intensive task. Another known approach is to devise a supervised approach to generate models based on manually annotated data. However, this approach is also a labor-intensive task.