In certain applications, it is helpful to understand the intent of content. For example, a post on a social networking site or bulletin board stating: “I have two chairs to sell, $20 each” could be classified as having an intent of selling something. In another example, if a user reads an article and comments “this is fake,” the intent may be to flag the content as inappropriate or misleading. Algorithmically classifying the content in this manner allows automated actions to be performed efficiently (e.g., tagging the post as a “sale,” searching for certain intents, moving the post to an appropriate page based on the topic, flagging the post as inappropriate, etc.).
Content may be classified based on its intent by a classifier. Typically, a classifier is trained, using machine learning, from labeled input data. For example, the input data may include text (“I have two chairs to sell”) and a corresponding intent label (“for sale”). By exposing the classifier to multiple labeled examples, the classifier begins to learn the intent of new content.
For many organizations interested in performing classification (businesses, academic institutions, governments, etc.), training data is predominantly available in a preferred language. For example, a company based in the United States might have access to a great deal of labeled English-language training examples, but might not have a large number of labeled Mandarin examples. Thus, to classify content in other languages, the content is typically subjected to machine translation to convert it into a language for which a classifier exists, and then the automatically translated result is classified using the classifier.
One problem with this approach is that machine translation systems are typically trained from a relatively small bilingual corpus (e.g., pairs of words or phrases in a source language and a target language). The bilingual training corpus often comes from a single domain (e.g., news stories translated from a first language to a second language may be readily available). Because the corpus comes from a single, or limited number, of domains, a number of problems can arise when attempting to apply a classifier. For example, the translator's vocabulary outside of its preferred domain may be limited, which can cause problems for a classifier that may be looking for target words or phrases.
This problem could be avoided by training a classifier in the original language of the content and applying the classifier directly, without intervening machine translation. However, it is often impractical to train classifiers in every possible language that could be encountered.