As large data networks span the globe to make the online world truly a multinational community, there is still no single human language in which to communicate. Electronic messages and documents remain written in a particular human language, such as German, Spanish, Portuguese, Greek, English, Chinese, Japanese, Arabic, Hebrew, or Hindi.
In many situations there is a need to quickly identify the human language of a particular document for further natural language processing. For example, identification of the document's human or natural language is helpful for indexing or classifying the document. In other situations, a word processor may need to identify a document's language for spell-checking, grammar-checking, to use language translation tools or libraries, or to enable appropriate printer fonts.
Previous methods of language identification include n-gram methods, especially tri-gram methods. In some tri-gram methods, language specific training data or documents have been used to create tables or profiles for the respective languages, called tri-gram language profiles. In some implementations, a three-letter window is slid over training text of a particular language. As the three-letter window is slid over the text, the method counts the occurrence of three-letter sequences appearing in the window to generate a tri-gram language profile for the particular language. This process is repeated for text of various languages to provide sets of tri-gram language profiles for the respective languages, which are used later for language identification of documents of unknown language.
During language identification, a similar three-letter window is slid over the unknown document. For each three-letter sequence within the unknown document, the method seeks to find matching-three-letter sequences in each of the tri-gram profiles. If a match is found for a particular language, the frequency information within that language's profile for the matched three-letter sequence can be added to a cumulative score for the particular language. In this manner, cumulative scores for each language are incremented as the window is slid over the whole unknown document. Other scoring schemes are also used such as storing n-gram frequency information as probability values. During matching, these probability values can be multiplied to generate cumulative language scores. The language having the highest cumulative score is deemed to be the language of the unknown document. Unfortunately, tri-gram methods are typically computationally expensive.
Another method of language identification includes varying the length of the n-gram sequences. In such language identification systems an n-gram profile, more generally referred to as a “language profile,” includes frequency information for various length n-grams (e.g. bi-grams, tri-grams, or 4-grams). However, as with tri-gram methods, other n-gram methods are computationally expensive, and thus, relatively slow. This lack of speed generally becomes more problematic as the number of languages being considered increases. Further, lack of speed can be especially problematic when language identification is coupled with other applications, such as document indexing. Advantageously, however, tri-gram and other n-gram language identification methods are considered relatively accurate when the document or text sample is rather brief, such as an individual sentence.
A faster and/or improved method of language identification in view of issues associated with prior art language identification methods and systems would have significant utility.