Oftentimes, there is a lack of sufficient relevant data upon which to train translation systems for particular tasks—there is a small amount of in-domain data and a large amount of general or non-in-domain data (also referred to as out-domain data). Using a subset of relevant data from the general domain (or a combination of the in-domain data and the relevant subset from the out-domain domain) improves performance over using either corpus individually, but a large portion of that out-domain data is at best irrelevant, and at worst, harmful, in that the out-domain data does not accurately represent the target domain.