Efforts to extract meaning from source data—including documents and files containing text, audio, video, and other communications media—by classifying them into given categories, have a long history. Increases in the amount of digital content, such as web pages, blogs, emails, digitized books and articles, electronic versions of formal government reports and legislative hearings and records, and especially social media such as Twitter, Facebook, and LinkedIn posts, give rise to computation challenges for those who desire to mine such voluminous information sources for useful meaning.
Particularly as the territorial reach of the Internet expands, one obstacle to obtaining value from digital content containing text is language classification. Categorization of text according to language is a prerequisite to any meaningful computational analysis of its content. Moreover, language classification can serve as a filter for information from a particular demographic.
Existing language classification techniques have several drawbacks. Some language classifiers adapted for use with digital content simply use information associated with an author profile, for example the author's location or primary language, as a proxy for the language in which that author has written. Of course, this approach is subject to error as the author may write in more than one language, none of which may be related to the author's profile information.
More sophisticated language classification techniques use statistical association algorithms that categorize text based on the probability that certain features of the text, e.g., character combinations, will occur in a given language. However, such algorithms require a large amount of human-generated training data for each language supported by the algorithm, particularly where the data to be categorized only includes small amounts of text, e.g., posts on social media websites such as Twitter.
Many commercially available products using the aforementioned statistical association algorithms suffer from the additional drawback that they are not customizable. The algorithms are often trained using a standardized set of training data, such as news articles or Wikipedia pages, which may not include features present in the data to be categorized. Social media posts, for example, often contain jargon unique to a given social media website that might not be encompassed in the training data. In addition, it may be difficult to add or remove languages from the training data based on the user's needs. Accordingly, there remains a need for improved language classifiers.