Information retrieval in the context of retrieving information from a corpus of text documents is the process of searching the content of text documents to obtain information contained within or conveyed by the text documents. A corpus often includes a large and unstructured collections or sets of text documents such as web pages, news documents, broadcast transcripts, electronic books and other sources of textual information that are stored within one or more document repositories. Entity relation detection is a form of information extraction in which semantic relations between entities are determined from the text of the corpus often using machine learning techniques such as natural language processing (NLP). Examples of entities that may be contained within the corpus include, for example, persons, organizations, companies, locations, objects, and countries. Examples of relations that may exist within entities includes, for example, a person-affiliation and organization-location.
A number of techniques exist to perform relation extraction from a corpus of text documents including supervised relation extraction, open information extraction, universal schema and distant supervision. Supervised relation extraction often requires manual human labeling of entity relationships within existing training data. A significant disadvantage of supervised relation extraction is that it requires a large amount of labelled relations within the training data which is expensive to obtain and often does not generalize for obtaining different relations than those contained within the training data. Open information extraction identifies sequences of words in sentences that denote relations between two entities. However, open information extraction is computationally intensive and does not scale well to larger document sets.
Universal schema relation extraction combines information from an existing knowledge base and open information extraction techniques to perform relation extraction upon a collection of documents using matrix factorization methodologies. Distant supervision for relation extraction uses an existing semantic knowledge base consisting of entities and relations between them to find sentences containing those entities in a large unlabeled corpus and extract linguistic and syntactic features to train a classifier.
The illustrative embodiments recognize that existing procedures for extracting relations from a corpus require a large amount of training data and also include deep learning based implementations that are computationally intensive during the training phase.