As large data networks span the globe to make the online world truly a multinational community, there is still no single human language in which to communicate. Electronic messages and documents remain written in a particular human language, such as German, Spanish, Portuguese, Greek, or English. In many situations, there is a need to quickly identify the human language of a particular document in order to further process the document. For example, identification of the document's human language may help when a user or system attempts to index or classify the document. In another situation, a word processor may need to determine the language of the document in order to use the appropriate spell checking, grammar checking, and language translation tools and libraries.
There are a variety of known methods for identifying the human language of text within an electronic document. In one method, a table is maintained having frequent function words in a variety of human languages. Examples of such frequent function words in the English language may include the words "the," "a," "which," and "you." For a particular document, a count is performed to determine how many of the frequent function words were found for each language. The language having the most frequent function words is identified as the language of the document. Unfortunately, this method typically requires that the section of the document read when determining the language is very long. This is due to the fact that a large amount of input is required before an accurate determination of the document's language can be made. Furthermore, this method becomes problematic when the number of possible languages increases. As the number of possible languages increases, it becomes more difficult to distinguish between languages.
Another method for identifying a document's language uses a set of predetermined rules regarding the occurrence of particular letters or sequences of particular letters that are unique to a specific human language. For example, the letter ".ang." is unique to the Swedish language. Thus, any document having the letter ".ang." is determined to be in the Swedish language. In another example, words ending in the letter sequence ".cedilla.ao" are unique to the Brazilian language. However, as with the previous method, the use of a set of predetermined rules can become problematic with a large number of potential languages. Additionally, this method does not perform well with only a limited or small amount of input text.
A third and popular method for identifying a document's language is known as a "tri-gram" method. In the tri-gram method, training documents representing each language are used to create a table or profile for each language, called a tri-gram language profile for each language. More particularly stated, a three-letter window is slid over a training document in a particular language. As the three-letter window is slid over the training document, the method counts the occurrence of the three-letter sequence appearing in the window. This yields a language profile, called a tri-gram language profile, for the particular language that characterizes the appearance of specific three-letter sequences. This is repeated for all of the languages to provide a set of tri-gram profiles for each language. When attempting to determine the language of an unknown document, a similar three-letter window is slid over the unknown document. For each three-letter sequence within the unknown document, the method seeks to find matching three-letter sequences in each of the tri-gram profiles. If a match is found for a particular language, the frequency information within that languages' tri-gram profile for the matched three-letter sequence is added to a cumulative score for the particular language. In this manner, cumulative scores for each language are incremented as the window is slid over the whole unknown document. The language having the highest cumulative score is then deemed the language of the unknown document.
Unfortunately, the tri-gram method is typically computationally intensive when compared to the other two illustrative methods described above. Furthermore, the tri-gram method may have problems accurately identifying the language of a document based on only a small amount of input text from the document.
Some commentators have suggested variations to improve the tri-gram method, such as using longer training documents. Another known improvement uses a similar method to look for either two letters (bi-gram), four letters (quad gram) or some other predefined number of letters within the window. However, all of the above described methods for identifying a document's language typically require relatively large amounts of input text from the unknown document in order to accurately determine the correct language.
Accurately identifying the language of input text can be problematic when the input text is merely a sentence or some other short length of text. For example, it may be desirable to recognize the language of a search query to a World Wide Web search engine on the Internet. The ability to quickly identify the language of the search query allows such a search engine to limit the search to documents that match the language of the search query. Therefore, there is a need for a system for identifying the language of a sample input of text that (1) is quick, (2) remains accurate, and (3) can accurately identify the language when the sample input is very short.