The discussion below is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
Linguistic testers constantly need corpora or corpus (bodies of sentences in a particular human language) to debug, test and assess the linguistic systems that are being developed. These linguistic systems may include features such as word-breakers, spellers, grammar-checkers, search engines, machine translation and many more. A typical approach is to license corpora from vendors, for example digital versions of encyclopedias and books, news articles and so forth. However, there are a number of disadvantages to this approach.
For example, it can become extremely expensive to acquire large quantities of corpora. Also, it can be difficult to find a vendor for languages that are spoken by a small number of people such as Maori or Nepali. Further, the corpora that can be acquired from vendors are typically well edited. As a result, they are not useful for testing linguistic systems such as the spellers and the grammar checkers because they are not representative of “real-world” writing scenarios where an informal user would make lots of editing mistakes. Another disadvantage of licensing corpora is that they tend to be fixed and limited. Consequently, the linguistic system may become tuned to specific corpus and not work well for linguistic phenomena that are not present in the corpus.