1. Field of the Invention
The present invention relates to identifying the language of a text which can be short and made up of only a few words or even a single word.
2. Description of the Prior Art
The invention applies in particular to the automatic processing of natural language to recognize the language of a written text, for example before the text is translated into other languages or synthesized into a spoken message. Tools for automatically processing natural language, such as syntactical analyzers and/or semantic analyzers, use data sets characterizing only one language at a time, such as a lexicon of basic lexical forms constituting dictionary or lexicon entries, morphological rules and grammatical rules, for only one language at a time. Even if the tool is capable of processing any language, the data is often prepared in order to analyze one language at a time.
Identification of the language of a text is therefore essential before analyzing the text linguistically.
To cite another example, identification of the language is even more necessary if a text is written in more than one language, for example to translate a multilingual text into a single language.
U.S. Pat. No. 5,062,143 proposes a statistical approach to analyzing the language used in a text using trigrams, i.e. strings of three consecutive characters. Initially, for each language, trigrams that appear the most frequently in a text of that language of a reasonable size, for example approximately 500 characters, are detected to constitute a key set of trigrams. Trigrams whose frequency of occurrence is at least equal to a prescribed frequency are used as the key set for that language. For a 26 letter alphabet and trigrams made up of characters including at least one space position, for example, the key set comprises approximately 80 trigrams that occur at a frequency representative of a fairly high probability.
The text of which the language to be identified is then analyzed to break it into trigrams in order to recognize therein and count the trigrams of the key set for a given language. The trigrams of the key sets for the other languages are also detected and counted. The language for which the percentage of matches of trigrams with the respective key set is the greatest and exceeds a prescribed value is deemed to be the language in which the text is written.
The foregoing identification of a language by means of a statistical approach is considerably dependent on the length of the text whose language is to be identified. If the text, such as a sentence, is relatively long, the trigram-based approach of U.S. Pat. No. 5,062,143 yields a reliable result even if the text contains words of another language. On the other hand, the identification of a language in a short sentence by means of only trigrams is significantly less precise, especially as the number of languages to be identified is large. For example, the language of the English sentence “I want to go to Birmingham” may be identified as Polish, because of the trigrams “t-o-space”, and “space-t-o”, which are more frequent in Polish than in English.
Instead of identifying the language of an entire text document, the method of U.S. Pat. No. 6,292,772 B1 accurately identifies the language of individual words. The identifying method of this patent utilizes character n-grams of any length, e.g. unigrams, bigrams, trigrams, and so on, and not just trigrams. Each word is broken down into one or more consecutive n-grams to determine a first n-gram at the start of the word, one or more subsequent n-grams and an end n-gram that do not overlap and characterize the word to be analyzed. All these n-grams are compared to prestored n-grams of a language defined statistically in texts from which the language is learned.
This method therefore determines the language to which an isolated word belongs and is repeated for each of the words of a text to identify the language of that text.
If a word, i.e. an n-gram model, is contained in plural languages, respective weights are assigned to those languages to distinguish them. For example, if the word is “de”, the statistical approach without weighting indicates exactly the same probability for French, Dutch and Spanish, these three languages all including the word “de”. Weighting makes it possible to designate one of these three languages even though it is not certain that in the context of a sentence the word really belongs to that language.
U.S. Pat. No. 6,415,250 relates to an automatic language identification system based on a probabilistic analysis of predetermined portions of words extracted from an input text whose language is identified. A word portion is a prefix or a word ending having a predetermined number of characters, generally a suffix at the end of a word. A corpus analyzer associates with each word portion of a predetermined corpus in a language a normalized frequency representative of the number of times that the word portion was found in the corpus and a relative likelihood or probability derived from the frequency relative to the size of the corpus. In particular, if the word portion rarely appears in the language, the probability is close to zero. A language identification engine in the analyzer sums for each language the relative probabilities for the extracted word portions recognized in the corpus of the language and retains only the greatest sum of the accumulated relative probabilities to identify the language of the input text.
The language identification system of the previously cited U.S. patent is inaccurate since it is limited to a single category of first character strings, such as suffixes (or prefixes), in a word and therefore does not analyze each word to extract therefrom all possible character strings, regardless of their positions in the word and their lengths. The analyzer analyzes only one character string per extracted word relative to the corpus of a language.