The Internet has made it possible for people to connect and share information globally in ways previously undreamt of. Social media platforms, for example, enable people on opposite sides of the world to collaborate on ideas, discuss current events, or just share what they had for lunch. As communications become ever more digitized, computerized language processing such as machine translations, part-of-speech (POS) tagging, and language corrections have become widespread.
Some methods of language processing use trained engines, such as POS tagging engines, machine translation engines, and correction engines. Trained engines can be created using training data. POS tagging engines can be trained using sentences. phrases, or n-grams (collectively herein “snippets”) with associated POS tags. Machine translation engines can be trained using natural language snippet pairs that include identical or similar content in two or more languages. Correction engines can be trained using natural language snippet pairs that include a first language snippet and a subsequent language snippet that is a correction of the first language snippet.
Obtaining training data for an engine can be difficult and expensive. In some cases, training data is obtained by human or machine preparation of correct outputs for corresponding inputs. The engine can then be trained using the prepared training data to learn to produce similar results. However, the number of available items which can be used to create training data often far exceed an amount for which creating correct outputs is feasible. In one case, for example, potential input to a POS tagging engine can be any post to a social media website. One popular social media site receives over 250 million posts per day; thus manually tagging even 1% of the billions of possible inputs is not feasible.
The techniques introduced here may be better understood by referring to the following Detailed Description in conjunction with the accompanying drawings, in which like reference numerals indicate identical or functionally similar elements.