The Internet has made it possible for people to connect and share information globally in ways previously undreamt of. Social media platforms, for example, enable people on opposite sides of the world to collaborate on ideas, discuss current events, or just share what they had for lunch. In the past, this spectacular resource has been somewhat limited to communications between users having a common natural language (“language”). In addition, users have only been able to consume content that is in their language, or for which a content provider is able to determine an appropriate translation.
While communication across the many different languages used around the world is a particular challenge, several machine translation engines have attempted to address this concern. Machine translation engines enable a user to select or provide a content item (e.g., a message from an acquaintance) and quickly receive a translation of the content item. Machine translation engines can be created using training data that includes identical or similar content in two or more languages. Multilingual training data is generally obtained from news reports, parliament domains, educational “wiki” sources, etc. In many cases, the source of the training data that is used to create a machine translation engine is from a considerably different domain than the content on which that machine translation engine is used for translations. For example, content in the social media domain often includes slang terms, colloquial expressions, spelling errors, incorrect diacritical marks, and other features not common in carefully edited news sources, parliament documents, or educational wiki sources.