There are in principle two different techniques for the automatic identification of the language of a text document: the word-based language identification on the one hand and the N-gram based identification on the other. Both methods work well on long texts, while N-grams are considered to be more robust for shorter texts.
The word-based language identification technique uses the fact that every language has a set of commonly occurring words. Intuitively, a sentence containing the words and, the, in, would most probably be English, whereas a sentence with the word der would be more likely to be German. One obvious implementation of this technique is to keep a separate lexicon for each possible language, and then to look up every word in the sample text to see in which lexicon it falls. The lexicon that contains the most words from the sample indicates which language was used. Weighted sum can be used, if words are provided with a score.
An advantage of this method is that words, especially function words (pronouns, prepositions, articles, auxiliaries), tend to be quite distinctive for language identification.
A disadvantage of this method is that although common words occur enough in larger texts, they might not occur in a shorter input text. Also, lexicons, especially for highly inflected languages, could be prohibitively large. The usage of full form lexicons is also hampered by possible misspellings and errors (like those arising from OCR process) and by the presence in texts of out-of-vocabulary words, especially in compounding languages like German.
The second language modeling technique is based on character N-grams (sequences of N consecutive characters), where N ranges typically from 2 to 5. Similarly to the common words technique, this technique assembles a language model from a corpus of documents in a particular language; the difference being that the model consists of character N-grams instead of complete words.
Absence of linguistic motivation imposes the following disadvantage for N-gram method: N-grams are not as distinctive as function words. For example, the trigrams ‘_bo’, ‘bos’, and ‘ost’ are frequently used in the English language, and so the word bost, will have high score to be an English word. However, ‘bost’ is an archaic form not used in the modern English, while bost is an often used abbreviation in Sweden.
The rise of text data mining and knowledge management, makes new demands on the implementation parameters for language identification. In multi-lingual environments, identifying the language of a piece of text is usually a prerequisite for subsequent processing. In domains with severe constraints on the size of the analyzed texts and on computational resources, language identification of texts still remains an important practical problem. A need therefore exists for an improved method of language identification.
U.S. Pat. No. 6,292,772 entitled “Method for identifying the language of individual words” shows how decomposition of a word into a plurality of non-overlapping N-grams covering the entire word without gaps can be used to identify the language of this word. The method of implementation demonstrates that all three restrictions imposed on the decomposition—non-overlapping, non-gapped, coverage of the whole word are essential.
Current information retrieval is based on little use of linguistic tools. Development of linguistic tools is expensive, they are slow, and they are not available for many languages. Search tools try to use robust approaches, which combine language-dependent and language independent processing. A search (for example, a Google search) will not completely fail if one Latin-based text is identified as another Latin-based text (for example, Irish is identified as English).
However, there is a growing area of information extraction where language-dependent processing is vital. Whereas information retrieval finds relevant texts and presents them to the user, the typical information extraction application analyses texts and presents only the specific information from them that the user is interested in.
There is a need to provide a more “linguistic” approach to the problem of language identification. Computationally treatable features include at least: alphabet, phonetics, orthography, lexical roots, inflections/derivations/clitics, compounding, function (and other) words. However, many of these features are brittle, resulting in the features being difficult to use effectively. For example, “International” words like index or Schwarzenegger become purely Hungarian just by addition of a small suffix -nek (indexnek, Schwarzeneggernek); headlines are short, often containing “foreign” words, are capitalized and are not full sentences; diacritics are sometimes not used properly; texts contain typographical errors; emails and chat-rooms use informal styles of writings.