A corpus (plural: corpora) is data, or a collection of data, used in linguistics and language processing. A corpus generally comprises large volume of data, usually text, stored electronically. As an example, one or more scientific articles published in a publication can form a corpus.
Natural language processing (NLP) is a technique that facilitates exchange of information between humans and data processing systems. For example, one branch of NLP pertains to answering questions about a subject matter based on information available about the subject matter domain.
Domain-specific information comes in a variety of forms and sizes. Consider the example of the information presented in a scientific article in a publication. Such information can range from being a few sentences long to a few pages long.
Additionally, different domain-specific information can follow different arrangement or organization of the information presented therein. Using the example of the scientific article again, some articles follow an organization commonly followed by other articles presented in the same scientific subject matter domain, whereas other articles adopt an organization that is a departure from a commonly used organization in that scientific domain.
Furthermore, the subject matter of the information can be conveyed in different ways. For example, the scientific article may be drafted in a way such that the information in the article is understandable by persons not trained in the particular scientific domain. Alternatively, the information in the article may be presented in such a way that makes it difficult for untrained persons to understand the contents of the article.