Natural language text has been a fundamental means of representing human knowledge and understanding. In an increasingly digital world, there is an exponential growth in readily accessible text. The web contains vast repositories of unstructured text. Many a times, finding relevant information is challenging. Information extraction (IE) is a task of automatically extracting structured information from unstructured and/or semi-structured text. The extracted information may be used in a variety of semantic web applications such as authoring ontologies via web ontology language (OWL), modeling information using resource description framework (RDF), question answering (QA), and so forth.
IE systems typically extract a set of subject-verb-object (SVO) triples for use in knowledge gathering and integration. Thus, the knowledge is represented in the triples format. In most of the cases, the extracting the set of triples involves processing natural language texts by means of natural language processing (NLP). The processing includes extracting tokens (i.e., words or phrases), identifying part of speech (PoS) for each tokens, and chunking PoS tokens. Chunking is typically used in shallow parsing of text and groups PoS tokens into sequences of syntactically related words. These include groups of noun phrase, verb phrase, adjective phrase, and so forth. As will be appreciated, the availability of a larger set of NLP tools such as OpenNLP has made it possible to PoS tag and chunk vasts amount of unstructured text available on the Internet. Additionally, projects like ClueWeb, OpenIE, and Wikipedia provide a corpus of text data which may be used for ontological engineering. A knowledge graph representing an unstructured text source may provide additional logical and inference functionality. Further, it should be noted that the PoS tag data provides better language inference and understanding as compared to a bag of words approach of web scale unstructured data.
There are various IE techniques to extract SVO triples from the unstructured data. For example, DBpedia extractor is employed to generate a set of triples from Wikipedia using annotated field information in Wikipedia. Further, ClauseIE system uses a dependency parser to output a set of word triples. Further, OpenIE system (e.g., REVERB, R2A2, etc.) uses PoS tagger and chunker model followed by rules based engine to output a set of word triples. The OpenNLP chunker model is used to chunk noun phrase (NP), verb phrase (VP) and prepositional phrases (PP) from the PoS tagged text received from the PoS tagger. The chunker data (i.e., chunked text) is then fed to a rule-based relationship extractor and a rule-based argument extractor that employ a set of rules to extract a set of triples. The extracted triples consist of left and right argument phrases from the input sentence and a relation phrase (predicate) from the input sentence, and is in the format (argument 1; relation; argument 2). The relation phrases expresses a relation between the argument phrases. The REVERB uses shallow syntactic processing to identify relation phrases that begin with a verb and occur between the argument phrases.
Thus, the current techniques employs PoS data, chunker data, and parser data as input, and a set of heuristics to determine a set of triples, thereby resulting in an additional overhead on extraction efficiency. Additionally, a rule based engine is typically employed when there is a lack of labeled data. The rule based systems have the drawback that they are designed around the set of heuristics. Further, the set of rules have to be developed and updated manually.