Language model training data for a machine translation system can include very substantial amounts of data, such as trillions of words of web data and billions of words of news data. A language model can be associated with translation development and evaluation data. Such translation development and evaluation data can be contained in the training data. This data needs to be removed from the language model training data to avoid training poor feature weights and/or incorrectly estimating quality on truly unseen data.
Some machine translation system covers a large number of language pairs, resulting in a large number of test data sets. For example, some of the test data can include frequent sentences taken from the web. Sometimes, whole documents cannot be filtered but, rather, filtering needs to be performed at a single-sentence level. On the other hand, filtering single sentences might remove too much data, especially for short sentences. For example, removing the sentence “next page” from a language model can be a disadvantage if that sentence is expected to be part of data handled by a live system. Also, it can be a disadvantage if updating or expanding test data for a particular language pair affects other language pairs.