A corpus (plural: corpora) is data, or a collection of data, used in linguistics and language processing. A corpus generally comprises large volume of data, usually text, stored electronically.
Natural language processing (NLP) is a technique that facilitates exchange of information between humans and data processing systems. For example, one branch of NLP pertains to answering questions about a subject matter based on information available about the subject matter domain.
Information about a domain can take many forms, including but not limited to knowledge repositories and ontologies. Such information can be sourced from any number of data sources. The presenter of the information generally selects the form and content of the information. Before information can be used for NLP, generally, the information has to be transformed into a form that is usable by an NLP engine.
Presently, systems and methods are available to parse unstructured data into a structured form. Presently available systems, such as information extraction systems, are adept at extracting and classifying named entities, such as people, cities, genes, proteins etc., from a given corpus. Presently available methods can also establish simple semantic relationships between the extracted entities. Of example, presently available methods can relate that an extracted person entity ‘lives in’ an extracted city entity, one extracted gene entity ‘inhibits’ another extracted gene entity, and so on.
Presently available systems and methods for information extraction construct “triples” of extracted information. A triple is an [ENTITY <VERB> ENTITY] construct, where one of the entities is a subject specified in the given corpus, and the subject entity performs an act (verb) specified in the corpus on an object entity specified in the given corpus. For example, given suitable corpus, a presently available system or method can create a triple such as [Barack Obama <president of> US].
Presently, the extracted triple artifacts can be stored, indexed, and made available for semantic processing of data and document retrieval. Existing frameworks such as Resource description Framework (RDF) and Web Ontology Language (OWL) are some examples of presently available methods for extracting such triples.