Natural language processing (NLP) is a technique that facilitates exchange of information between humans and data processing systems. For example, one branch of NLP pertains to transforming human readable content, such as unstructured data, into machine usable data. For example, NLP engines are presently usable to accept input content such as a newspaper article or a whitepaper, and produce structured data, such as an outline of the input content, most significant and least significant parts, a subject, a reference, dependencies within the content, and the like, from the given content.
Another branch of NLP pertains to answering questions about a subject matter based on information available about the subject matter domain. Information about a domain can take many forms, including but not limited to knowledge repositories and ontologies created from machine usable data created from unstructured data in the first branch of NLP.
A corpus (plural: corpora) is data, or a collection of data, used in linguistics and language processing. A corpus generally comprises large volume of data, usually text, stored electronically. The corpus comprises unstructured data. Unstructured data is data that does not conform to any particular organization, and position or form of the content in a data fragment of unstructured data generally does not contribute to the meaning or significance of the content. A newspaper article, a whitepaper document, notes taken by a researcher, or generally human readable textual data in a variety of forms are some examples of unstructured data.
Presently, systems and methods are available to parse unstructured data. The parsing recognizes the words present in the unstructured data of a given corpus and extracts those words as tokens for use in the NLP of the corpus.