Many situations arise in which it is desired to programmatically determine the language (English, French, German, etc.) of a given sample text. One way to accomplish this is by comparing the sample text to reference texts of different languages.
In practice, such comparisons may be performed by first identifying n-grams of the sample text and of the reference texts, and by statistically comparing the n-grams. In general, an n-gram is an ordered sequence of data elements found in a larger sequence of data elements. With respect to text, an n-gram may be a sequence of n words or n characters, where n may be any integer larger than zero. In the context of language comparison, an n-gram is usually a sequence of characters. Thus, the n-grams of a particular text may include all possible substrings of size n that can be extracted from the text, including overlapping substrings. In some cases, the n-grams may be limited to characters that occur adjacently. In other cases, n-grams may include sequences in which the characters are found in a given sequence, but not necessarily adjacent to each other. Text is often normalized before identifying n-grams, such as by removing white space and punctuation, and by converting to a single case (uppercase or lowercase).
In order to determine the likelihood that a sample text corresponds to the language of a reference text, n-gram statistics for the sample text and the reference text can be calculated and compared. This can be done with respect to reference texts of multiple languages, in an attempt to determine which of the reference texts produces the best correspondence with the sample text.
Some methods of performing this analysis involve probability analysis. Specifically, when determining the likelihood that a sample text corresponds to the language of a reference text, each n-gram of the sample text is analyzed with respect to the reference text: for each n-gram, the analysis calculates the Bayesian probability that the n-gram might belong to the reference text. The calculated probabilities for multiple n-grams are then combined in some manner to indicate an overall probability of the sample text corresponding to the language of the reference text.
The Bayesian probability for an individual n-gram with respect to a particular language reference can be calculated in accordance with the conventional Bayesian formulation. In Bayesian terminology, the probability of a particular n-gram corresponding to a particular language reference is indicated symbolically as P(A|B), where B represents the occurrence of the n-gram, A represents the result that the n-gram is of the given language, and P(A|B) indicates the probability of A given B. P(A|B) can be calculated by the following equation:
            P      ⁡              (                  B          ❘          A                )              ⁢          P      ⁡              (        A        )                  P    ⁡          (      B      )      
In this equation, P(B|A) is the probability of B given A, which in this scenario is the probability or frequency with which the given n-gram occurs within the language reference, relative to other n-grams. For example, a particular n-gram may occur once in every 1000 n-grams of the reference, which may be represented as 0.001 or 0.1%.
P(B) represents the probability or frequency with which the individual n-gram occurs within all of the available language references, relative to other n-grams. For example, a particular n-gram may occur once in every 10,000 n-grams when evaluated with respect to the n-grams of all available language references, which may be represented as 0.0001 or 0.01%.
P (A) represents the probability, apart from any other factors, of any unknown n-gram being of a particular language. For many implementations, it may be assumed that every language has the same probability of occurrence, and this factor may therefore be removed or ignored for purposes of comparing between different languages. In other implementations, this factor may be a constant that is set for each individual language.
The process above results in a probability value for every n-gram of the sample text with respect to a reference language text. These calculated n-gram probabilities may be analyzed statistically to determine an overall likelihood that the sample text corresponds to the language of the reference language text. The overall likelihoods corresponding to different languages can then be compared to determine which language the sample text is mostly likely to represent.
Analyzing or combining the individual n-gram probabilities to create an overall evaluation of the sample text with respect to a particular reference text is typically accomplished by creating an ordered vector corresponding to the sample text, in which the vector contains all n-grams of the sample text in their order of probability. Similar vectors are created for the reference texts. A difference measurement is then calculated between the sample text vector and each of the reference text vectors, and the reference text having the smallest difference measurement is considered to represent the most likely language of the sample text. The difference measurements may be calculated in some embodiments as the edit distances between the sample text vector and the reference text vectors.