Structured data is data that conforms to an organization defined by a specification. In a data fragment of a structured data, the content of the data fragment has meaning or significance not only from the literal interpretation of the content of the fragment, but also from the form, location, and other organization-specific attributes of the fragment.
In contrast, unstructured data is data that does not conform to any particular organization, and position or form of the content in a data fragment of unstructured data generally does not contribute to the meaning or significance of the content. A newspaper article, a whitepaper document, notes taken by a researcher, or generally human readable textual data in a variety of forms are some examples of unstructured data.
Natural language processing (NLP) is a technique that facilitates exchange of information between humans and data processing systems. For example, one branch of NLP pertains to transforming human readable content, such as unstructured data, into machine usable data. For example, NLP engines are presently usable to accept input content such as a newspaper article or a whitepaper, and produce structured data, such as an outline of the input content, most significant and least significant parts, a subject, a reference, dependencies within the content, and the like, from the given content.
Another branch of NLP pertains to answering questions about a subject matter based on information available about the subject matter domain. Information about a domain can take many forms, including but not limited to knowledge repositories and ontologies created from machine usable data created from unstructured data in the first branch of NLP.
A corpus (plural: corpora) is data, or a collection of data, used in linguistics and language processing. A corpus generally comprises large volume of data, usually text, stored electronically.
Presently, systems and methods are available to parse unstructured data into a structured form. Presently available systems, such as information extraction systems, are adept at extracting and classifying noun entities, such as people, cities, genes, proteins etc., from a given corpus of unstructured data. Presently available methods can also establish simple semantic relationships between the extracted entities. For example, presently available methods can relate that an extracted person entity ‘lives in’ an extracted city entity, one extracted gene entity ‘inhibits’ another extracted gene entity, and so on.
Presently available systems and methods for information extraction construct “triples” of extracted information. A triple is an [ENTITY <VERB> ENTITY] construct, where one of the entities is a subject specified in the given corpus, and the subject entity performs, or is predicated upon, an act (verb) specified in the corpus on an object entity specified in the given corpus. For example, given suitable corpus, a presently available system or method can create a triple such as [Obama <president of> US].
Presently, the extracted triple artifacts can be stored, indexed, and made available for semantic processing of data and document retrieval. Existing frameworks such as Resource description Framework (RDF) and Web Ontology Language (OWL) are some examples of presently available methods for extracting such triples.