With the expansion of global computer networks, users can search for and gather documents from many sources many of which maybe unknown and often unmanaged sources. The language of documents found is not necessarily known a priori.
Traditional language detection techniques try to answer what are the chances that a document is in a given language. This answer is computed in isolation irrespective of what other languages the document could possibly be. Prior methods try to determine if documents in a given language will also have similar N-gram or small word distribution.
If the languages in a candidate list are significantly different from each other, the traditional technique works reasonably well. However, when languages are very similar to each other, many false positives are obtained.
The present invention provides a method and system for detecting the language of an unknown text that cures the above problems and others.