A number of techniques have been proposed for automatically identifying the language of a text. Grefenstette, G., "Comparing Two Language Identification Schemes," JADT 1995, 3rd International Conference on Statistical Analysis of Textual Data, Rome, 11-13 December 1995, pp. 263-268, compares two techniques, one using letter trigrams, the other based on common short words.
The trigram technique described by Grefenstette tokenizes large samples of text from each of a number of different languages using the space as sole separator and adding an underscore before and after each token to mark initial and terminal bigrams. The frequency of sequences of three characters in each language is then counted. Trigrams with more than a minimum frequency are retained, and probability of a retained trigram is approximated by summing the frequency of all retained trigrams for the language and dividing the trigram's frequency by the sum of frequencies. The probabilities are then used to guess the language of a sentence by dividing the sentence into trigrams and calculating the probability of the sequence of trigrams for each language, assigning a minimal probability to trigrams without assigned probabilities. The language with the highest probability for the sequence of trigrams is chosen.
The short word technique described by Grefenstette similarly tokenizes large samples of text from each of a number of different languages and calculates the frequency of all tokens, generally words, of five characters or less. Tokens with more than a minimum frequency are retained, and probability of a retained token is approximated as in the trigram technique. The probabilities are then used to guess the language of a sentence by tokenizing the sentence and calculating the probability of the sequence of tokens for each language, assigning a minimal probability to tokens without assigned probabilities. The probability that a sentence belongs to a given language is taken as the product of the probabilities of the tokens.
Grefenstette compared the techniques by feeding each sentence to each technique to obtain two language guesses. Either technique works well on long sentences, and trigrams are most robust for shorter sentences. This can be expected because shorter sentences may be titles or section headings that contain characteristic trigrams but may not contain short words. Using short words is slightly more rapid in execution because there are less words than trigrams in a given sentence, and each word or trigram contributes a multiplication to the probability calculation.
Martino et al., U.S. Pat. No. 5,548,507, disclose a language identification process using coded language words. In discussing prior art, Martino et al. distinguish trigram and n-gram character based systems. Martino et al. instead disclose a technique that reads word codes from a document and compares the word codes to predetermined lists of words selected from language or genres of interest. The language or genre of the document is identified by a process that determines which language's word list most closely matches the words in the document. Closeness of match is weighted by frequency of occurrence of listed words in each language or genre.
Dunning, T., "Statistical Identification of Language," Computing Research Laboratory Technical Report MCCS-94-273, New Mexico State University, 1994, pp. 1-29, discloses a statistically based program that learns to distinguish between languages. In relation to previous work, Dunning discusses unique letter combination techniques, common word techniques, N-gram counting with ad hoc weighting, and N-gram counting with rank order statistics. Dunning then discloses an N-gram technique that develops a set of character level language models from training data and then uses the language models to estimate the likelihood that a particular test string might have been generated by each of the language models.
The invention addresses basic problems that arise in automatic language identification using short or common word techniques and N-gram techniques. One problem relates to sample size, another to the different contexts in which each technique works better.
As noted by Grefenstette, both short word and N-gram techniques work well on a large sample such as a long sentence, while N-gram techniques are more robust for smaller samples such as short sentences. Even N-gram techniques, however, work less well as the size of the sample decreases. As a result, even N-gram techniques become unsatisfactory for the very small samples that typically occur in some applications, such as user input queries to Internet search engines.
As noted by Dunning, common word techniques are difficult or impossible to apply to languages in which tokenization into words is difficult (as in Chinese) or in which it is difficult to define a set of common words. Martino et al., on the other hand, argue that trigrams, N-grams, and other artificial divisions in a computerized text are not considered reliable and have limited success in identifying the correct language. A more general statement of this problem is that, in some contexts, N-gram techniques produce better results than word techniques, while in others, word techniques produce better results.
The invention is based on the discovery of a new technique for automatic language identification that alleviates these problems. The new technique automatically identifies a natural language that is likely to be the predominant language of a sample text. To do so, the new technique uses text data defining the sample text and probability data for each of a set of languages to automatically obtain, for each language in the set, sample probability data indicating a probability that the sample text would occur in the language. The new technique then uses the sample probability data to automatically obtain language identifying data. The language identifying data identify the language in the set whose sample probability data indicate the highest probability.
In the new technique, the probability data for at least one language include N-gram probability data and the probability data for at least one language include word probability data. The N-gram probability data for a language indicate, for each of a set of N-grams, a probability that the N-gram occurs in a text if the language is the predominant language of the text. The word probability data for a language indicate, for each of a set of words, a probability that the word occurs in a text if the language is the predominant language of the text.
The new technique automatically obtains sample probability data for each of a subset of the languages that includes at least one language with N-gram probability data and at least one language with word probability data. The sample probability data of a language with N-gram probability data include information from the language's N-gram probability data and the sample probability data from a language with word probability data include information from the language's word probability data.
The new technique can be implemented with probability data for trigrams and with probability data for words of five characters or less. Sample probability data can be obtained for every language that has N-gram or word probability data. At least one language can have both N-gram and word probability data, and sample probability data for each such language can include information both from its N-gram probability data and its word probability data. At least one language can have only N-gram probability data.
In the case where a language has both N-gram and word probability data, the N-gram probability data and the word probability data can include a probability value for each N-gram or word, and the probability values can be used to obtain the language's sample probability data. The probability values can be logarithmic, so that probability values for each N-gram that occurs in the text sample can be added to obtain a total N-gram probability value and probability values for each word that occurs in the text sample can be added to obtain a total word probability value. A constant probability value indicating a low probability can be used for each N-gram or word for which a probability value is not included. The total probability values can then be combined, such as by adding and then dividing by two, to obtain a sample probability value for the language.
The new technique can further be implemented in a system that includes text data defining a sample text, probability data as described above, and a processor that uses the text data and probability data to automatically obtain sample probability data as described above. The processor then uses the sample probability data to automatically obtain language identifying data identifying the language whose sample probability data indicate the highest probability.
The new technique can also be implemented in an article of manufacture for use in a system that includes text data defining a sample text and a storage medium access device. The article can include a storage medium and probability data and instruction data stored by the storage medium. The system's processor, in executing the instructions indicated by the instruction data, uses the text data and the probability data to automatically obtain sample probability data as described above. The processor then uses the sample probability data to automatically obtain language identifying data.
The new technique can also be implemented in a method of operating a first machine to transfer data to a second over a network, with the transferred data including probability data and instruction data as described above.
In comparison with conventional techniques for automatically identifying language using only N-grams or using only words, the new technique is advantageous because it combines both approaches in a way that increases overall recognition accuracy without sacrificing the accuracy obtained by each approach separately for large samples. In particular, the new technique enjoys markedly increased accuracy for small samples, and can be successfully used for very small samples, such as user input queries to Internet search engines. The new technique also achieves increased accuracy even where smaller texts in each language are used to obtain N-gram and word probabilities.
The new technique is also advantageous where the set of languages being distinguished includes a pair of closely related languages that share most of their trigrams but have different function words, such as Spanish and Portuguese. In this context, the new technique provides the advantages of N-gram techniques, yet can distinguish the closely related languages based on word probability information.
The new technique is also advantageous because it can be readily extended to additional languages. Specifically, in comparison with conventional language identification techniques that use solely trigrams or other N-grams, the new technique can be more easily extended to additional languages because it does not produce larger confusion matrices as the conventional techniques do.
The new technique has proven to work well with a set of more than 30 languages, and adding more languages to the set does not reduce recognition accuracy for the languages that were already in the set. As a result, the new technique can be readily applied to newly available linguistic data, such as non-English text retrieved from the World Wide Web. The new technique has also is been successfully applied to a set that includes languages without distinguishable word boundaries or with multi-byte character sets, such as Chinese and Korean.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.