The Internet has made it possible for people to connect and share information globally in ways previously undreamt of. Social media platforms, for example, enable people on opposite sides of the world to collaborate on ideas, discuss current events, or share what they had for lunch. In the past, this spectacular resource has been somewhat limited to communications between users having a common natural language (“language”). In addition, users have only been able to consume content that is in their language, or for which a content provider is able to determine an appropriate classification or translation.
While communication across the many different languages used around the world is a particular challenge, several types of language modules, such as language classifiers, language models, and machine translation engines, have been created to address this concern. These language modules enable “content items,” which can be any item containing language including text, images, audio, video, or other multi-media, to be quickly classified, translated, sorted, read aloud, and otherwise used based on the semantics of the content item. Language modules can be created using “training data,” which is data with a classification that can be compared to other data to assign additional classifications. Training data is often obtained from news reports, parliament domains, educational “wiki” sources, etc. where language classifications are assigned. In many cases, sources of the training data do not account for differences in dialect used within particular languages. For example, traditional speech recognition and machine translation systems for Arabic focus on Modern Standard Arabic (MSA), and do not account for other Arabic dialects, which can differ from MSA lexically, syntactically, morphologically, and phonologically. Such speech recognition and machine translation systems are not able to adequately recognize or translate content items to or from non-MSA dialects.