Machine learning-based applications provide structured results to given inputs such as phrases to be machine translated, search queries, text and rich media for machine summarization, data to be processed, etc. The machine learning algorithms used in these applications often rely upon human-generated data, some of which may be drawn from public sources, such as the web. For example, machine translation applications can rely upon human-generated text on the web as a source of parser training data.
Many applications publish their machine-generated output online, thereby contaminating the web as a reliable source of human-generated data. For example, the web has substantial populations of both machine-translated output and human-generated translations with no convenient way of distinguishing between them. As a result, applications that mine data from the web with the goal of learning to simulate human behavior will learn from data contaminated with machine-generated content. The resulting simulations will therefore exhibit less fidelity to actual human behavior.