The exemplary embodiment relates to a system and method for language identification and finds particular application in the context of social media.
The textual content of social media posts can provide a wealth of information which can assist companies in understanding the views of customers about their products and services, allowing them to make improvements in products and services as well as providing improved customer care. Opinion mining techniques have been used to assign an opinion or emotion to a particular textual comment. Since there is generally no restriction on the language which can be used, the first stage in analyzing such documents is to identify the language of the document.
Methods for identifying the language of a written document are used in a number of applications, including translation, information retrieval, and the like. The accuracy of existing methods is generally quite high, and can be close to 100% in some cases. See Paul McNamee, “Language identification: A solved problem suitable for undergraduate instruction,” J. Comput. Sci. Coll., 20(3):94-101 (2005); and Thomas Gottron, et al., “A comparison of language identification approaches on short, query-style texts,” Adv. in Information Retrieval, pp. 611-614 (2010). However, in some contexts, such as for social media documents, the accuracy can be much lower. Social media texts are often written in a much less organized and less formal way than are traditional structured and edited documents. They often contain slang, abbreviations, code-switching (alternating between two or more languages, or language varieties, in the context of a single conversation) and can be extremely short. Language prediction accuracies of up to only about 70-80% are more typical for such texts, even when the list of possible languages is limited.
Traditional language identification methods often include comparing a document with a fingerprint of each language using, for example, a bag-of-n-grams (at the character or word level) or function words. Language identification on Twitter has been attempted using a baseline of character or word n-grams, which has been enhanced with additional sequential information by connecting character 3-grams in a graph (one graph per language) and finding a path of the tweet on this graph, as described in Erik Tromp, et al. “Graph-based n-gram language identification on short texts,” Proc. 20th Machine Learning Conf. of Belgium and The Netherlands, pp. 27-34 (2011). Some improvements can be achieved through a better pre-processing, as described in John Vogel, et al., “Robust language identification in short, noisy texts: Improvements to LIGA,” 3rd Intl Workshop on Mining Ubiquitous and Social Environments, p. 43 (2012).
Social media content is generally associated with metadata. For example, Twitter allows users to identify the geo-location in which they are based, which can be included as an additional signal. See, Moises Goldszmidt, et al., “Boot-strapping language identifiers for short colloquial postings,” Proc. European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (2013). However, others have found that self-reported geo-locations are a bad predictor, as well as being used by only a low proportion of overall Twitter users. See Mark Graham, et al., “Where in the world are you? Geolocation and language identification in Twitter,” The Professional Geographer (2014); Gregory Grefenstette, “Comparing two language identification schemes,” 3rd Intl Conf. on Statistical Analysis of Textual Data (JADT 1995), pp. 263-268 (1995), hereinafter, “Grefenstette 1995”; and Simon Carter, et al., “Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text,” Lang. Resour. Eval., 47(1):195-215 (March 2013), hereinafter, “Carter 2013.”
Other features that have been considered for improving language identification in social media posts include the user name, as well as its prefixes, binary features regarding the script and a special tokenizer for URL's to extract the hostname and top level domain name, previously guessed languages of an author (the author's language histogram), the language histogram of users mentioned in the post, and the context of the discussion (reply-to's are stored as meta-data), maximal repeats on a character level, and the like. Weighting mechanisms have also been proposed to combine two or more existing tools. See Carter 2013; Shane Bergsma, et al., “Language identification for creating language-specific Twitter collections,” Proc. 2nd Workshop on Language in Social Media, LSM '12, pp. 65-74 (2012), hereinafter, “Bergsma 2012”; Shumeet Baluja, et al., “Video Suggestion and Discovery for Youtube: Taking Random Walks Through the View Graph,” Proc. 17th Intl Conf. on World Wide Web (WWW '08), pp. 895-904 (2008), hereinafter, “Baluja 2008.” However, the research suggests that language and country metadata fields that come with the microblog posts tend to make poor signals for language identification, with the language field greatly over- or underestimating the true underlying language distribution and that the geo-location field is generally too sparsely used to be relied upon for language identification.
There remains a need for a system and method for improving the accuracy of language identification for social media text.