The increased collection and indexing of publications often requires that the language in which the publications are written be known. For the purposes of this specification, the term “language” shall mean a natural language (i.e., human language) used for personal communication, such as English, French, Spanish, Portuguese, German, etc., though the method presented here is not limited to natural languages, and may also be applied to artificial languages such as programming languages. For example, when indexing a database of documents, it may be helpful to classify the documents according to their corresponding languages. Language identification for some texts may be simple, for example, a publication that always appears in only one language. However, for a significant number of texts, particularly texts from a mixed database such as the World Wide Web, language identification is not so easy.
In order to assist document classifiers with identifying the language of a document's text, an XML (extensible Markup Language) marking may be manually placed in the text of the document. For example one can place the tags: <p xml:lang=“de”></p> around either side of a paragraph to show that the language of the paragraph is German, since “de” is the ISO 639 two-letter language code for German (See http://www.ietf.org/rfc/rfc1766.txt for a description of XML language markup, and www.ics.uci.edu/pub/ietf/http/related/iso639.txt for a description of ISO 639 codes.) However, a majority of documents do not contain such an XML marking. Thus, it is desirable to use an automated language identification tool, such as a computer program, to determine the language of the document. There are a number of language identification programs that are known in the art.
One such program compares short or frequent words (i.e., the, in, of, that for English; el, la, los, las, en, de, que for Spanish, etc.) in the document with common short words from a plurality of different languages. The common short words from each available language are stored in corresponding databases. Thus, after comparing the document's short words with the language databases, the programidentifies the language associated with the database containing the greatest number of short words from the document text. That is, the corresponding database yielding the highest frequency of short words from the textual passage results in identifying the text's language. See descriptions of these methods in both Beesley, Kenneth R., “Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-Line Text,” in the Proceedings of the 29th Annual Conference of the American Translators Association, 1988, and in Grefenstette, Gregory, “Comparing Two Language Identification Schemes,” in Proceedings of 3.sup.rd International Conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy; December, 1995, vol. II, pp. 263-268.
A similar approach to language identification involves the use of n-gram analysis. An n-gram is a set of “n” consecutive characters extracted from a word. Typical values for n are 2, 3, or 4. Assuming such values for n, the respective names for such n-grams are “bi-grams”, “tri-grams”, and “quad-grams”. The frequency approach used for analyzing short words can also be applied to n-grams because the main idea is that similar words will have a high proportion of n-grams in common. Thus, upon calculating the frequency profiles for each n-gram according to each language, the language yielding the highest frequency is determined to be the language in which the text is written. See description of these methods in both Cavnar, William B., et al., “N-Gram-Based Text Categorization,” in Symposium on Document Analysis and Information Retrieval, 1994, and Dunning, Ted, “Statistical Identification of Language,” CLR Tech Report (MCCS-94-273), 1994.
Another known method of language identification is described U.S. Pat. No. 5,062,143. In this method a text is divided into tri-grams. The tri-grams are compared with key sets of common tri-grams of various languages. The number of tri-grams found for each language is divided by the total number of tri-grams found in the original text. The language possessing the highest ratio of identified tri-grams is retained as the identity of the original text. The approach of the present invention differs from this particular prior art in providing a significantly different method of comparing n-grams (the present invention is not limited to tri-grams) and for weighting and using the n-grams retained in language key sets.
In another known method of language identification (U.S. Pat. No. 6,216,102) the most common words in each language are truncated to a predetermined length and stored in a key table for that language. When the language identifier is presented with a new text to identify, the words in the text are truncated to this predetermined length and each truncated word is compared to each language key table. The language key table that contains the maximum number of truncated words in common with the presented text is chosen as the language of the text. U.S. Pat. Nos. 5,548,507, 6,009,382, and 6,023,670 are variants of this same method, but the variants do not truncate words before comparison. The method of the present invention differs from all these variants significantly in extracting a plurality of information bearing n-grams from each word in the input text, including word endings, which are good characteristics of languages ignored by this method. The scoring method we reveal in the present invention is more sophisticated than this simple counting technique.
The problem with using a frequency approach with either short words or n-grams is that some languages have similar short words and similar n-grams. For example, the word “que” is present in the French, Spanish, and Portuguese languages. The presence of the same word, or n-gram, such as “que”, in multiple languages has the capacity to distort the frequency analysis. Some current language identification methods including those cited above ignore this frequency distortion problem and others (e.g., U.S. Pat. No. 6,167,369) simply remove similar words from the frequency analysis. Thus, there is a need to address the problem associated with the same word(s) or same n-gram appearing in multiple languages appropriately so as to improve the accuracy of language identification