A corpus (plural: corpora) is data, or a collection of data, used in linguistics and language processing. A corpus generally comprises large volume of data, usually text, stored electronically. Hereinafter, unless expressly distinguished where used, a document comprises any data that is available as text, can be converted to text, or is recognizable as text, in some natural language, for the purposes of Natural Language Processing (NLP).
A natural language (NL) is a written or a spoken language having a form that is employed by humans for primarily communicating with other humans or with systems having a natural language interface. Thus, a document contemplated herein can be text, audio data that can be transcribed into text, video data from which textual description or transcription is possible, or some combination thereof.
NLP is a technique that facilitates exchange of information between humans and data processing systems. For example, one branch of NLP pertains to transforming human readable or human understandable content from a document into machine usable data. For example, NLP engines are presently usable to accept input content such as a newspaper article or human speech, and produce structured data, such as an outline of the input content, most significant and least significant parts, a subject, a reference, dependencies within the content, and the like, from the given content.
NLP employs techniques such as shallow parsing and deep parsing. Shallow parsing is a term used to describe lexical parsing of a given content using NLP. For example, given a sentence, an NLP engine determining what the sentence semantically means according to the grammar of the language of the sentence is the process of lexical parsing, to wit, shallow parsing. In contrast, deep parsing is a process of recognizing the relationships, predicates, or dependencies, and thereby extracting new, hidden, indirect, or detailed structural information from distant content portions in a given document or a corpus.
Generally, not all documents are equally important, relevant, or useful for a given purpose, or contain equally useful information. Document ranking is a known process of arranging documents in some order of relevance according to a given condition. One known method of document ranking arranges the documents based on the frequency of occurrence of a given word or phrase therein. For example, a search query for “zebra” might result in one hundred documents. These one hundred documents are ranked according to a number of times the word “zebra” appears in them. The highest ranking document will have the most occurrences of the word, and the last ranking document the least.
Another known method of document ranking orders the documents, where the order is indicative of a sentiment expressed in the documents. For example, a search query for “favorable impression of Florida vacation” might find ten documents that each discuss vacation experiences in Florida. These ten documents are ranked by a degree of positive sentiment expressed towards the experience of vacationing in Florida.