The Internet has made it possible for people to globally connect and share information in ways previously undreamt of. Social media platforms, for example, enable people on opposite sides of the world to collaborate on ideas, discuss current events, or just share what they had for lunch. In the past, this spectacular resource has been somewhat limited to communications between users having a common natural language (“language”). In addition, users have only been able to consume content that is in their language, or for which a content provider is able to determine an appropriate translation.
While communication across the many different natural languages used around the world is a particular challenge, several machine translation engines have attempted to address this concern. Machine translation engines enable a user to select or provide a content item (e.g., a message from an acquaintance) and quickly receive a translation of the content item. In some cases machine translation engines can include one or more “translation models” and one or more “language models.” Creating a translation model can use training data that includes identical or similar content in both a source and an output language to generate mappings of words or phrases in a source language to words or phrases in an output language. Creating a language model can use training data that includes a corpus of data in the output language to generate probability distributions of words or phrases that are likely to go together in the output language.
Machine translation engine training data is often obtained from news reports, parliament domains, educational “wiki” sources, etc. In many cases, the source of training data that is used to create a machine translation engine is from a considerably different domain than the content on which that machine translation engine is used for translations. For example, content in the social media domain often includes slang terms, colloquial expressions, spelling errors, incorrect diacritical marks, and other features not common in carefully edited news sources, parliament documents, or educational wiki sources.
The techniques introduced here may be better understood by referring to the following Detailed Description in conjunction with the accompanying drawings, in which like reference numerals indicate identical or functionally similar elements.