1. Field of the Invention
The present invention is directed to human language recognition technology and, more particularly, to a system and method that automatically identifies the language in which a text is written.
2. Related Art
Knowledge of the language of a document, referred to as the source language, can enable text-oriented applications such as word processors, presentation managers, search engines and other applications which process, store or manipulate text, to automatically select appropriate linguistic tools (hyphenation, spell checking, grammar and style checking, thesaurus, etc.). Knowledge of the source language also provides the ability to decide whether a text may need translation prior to display and enables documents to be classified and stored automatically and efficiently according to their source language. With respect to electronic communications, language identification enables communications applications to catagorize, filter and prioritize messages, query hits and electronically-mailed messages and documents according to a preferred source language.
Generally, text-oriented applications are incapable of identifying the source language in which a given text is written. Instead, these applications typically assume that the text is written in a default source language, usually English, unless the language has been explicitly specified. In the event that the text was not written in the default language, any type of linguistic analysis or classification will fail or lead to indeterminant results.
There are several conventional methods that have been used to identify the source language in which an unknown text is written. One such conventional language identification method which is based on what is commonly referred to as trigram analysis is disclosed in U.S. Pat. No. 5,062,143 to Schmitt. A trigram is a sequence of three characters occurring anywhere in a body of text, and may contain blanks (spaces). Schmitt appears to disclose a system that determines a xe2x80x9ckey setxe2x80x9d of trigrams for each language by parsing a sample of text (approximately 5,000 characters) written in each language. The xe2x80x9ckey setxe2x80x9d for a language includes trigrams for which the frequency of occurrence within the unknown text accounts for approximately one third of the total number of trigrams present in the text.
To determine the language of an unknown text, Schmitt parses the text into successive trigrams and then iteratively compares each parsed trigram to the xe2x80x9ckey setxe2x80x9d associated with each language. The number of times each parsed trigram matches a trigram in a key set is counted. Schmitt then calculates a ratio (xe2x80x9chit percentxe2x80x9d) of the number of matches to the number of trigrams in the unknown text, and compares this hit percent to a predetermined threshold. If the hit percent for the particular language key set is greater than the predetermined threshold, Schmitt records the hit percent. After all language key sets have been processed, the language corresponding to the key set yielding the highest recorded hit percent is identified to be the language of the text. If there is no hit percent that exceeds the predetermined threshold, then no language identification is made.
There are several disadvantages to such trigram-based language identification methods. First, a significant number of samples of the unknown text need to be obtained. Take, for example, an unknown text containing the word xe2x80x9cmonitorxe2x80x9d (the average word length in the English language is approximately seven characters) which is separated from other words by spaces (denoted xe2x80x9c_xe2x80x9d). A trigram-based sampling of this word alone requires seven samples: _mo, mon, oni, nit, ito, tor, or_. The iterative comparison of each sampled trigram to the key set of each language is computationally expensive in both time and resources. In addition, because the identification of the unknown text requires a hit percent which exceeds a predetermined threshold, this approach is inherently based upon an assumption that a sufficient amount of sampled data from the unknown text is available to generate enough hits to exceed the threshold. As a result, this method is often found to be ineffective when used on small samples of unknown text, such as a title or header, that contains so few trigrams as to be incapable of exceeding this predetermined threshold.
Another conventional method used to identify the source language in which an unknown text is written is described in U. S. Pat. No. 5,548,507 to Martino et al. Martino appears to teach a language identification process using coded language words. Martino generates Word Frequency Tables (WFTs) each of which is associated with a language of interest. The WFT for a particular language contains relatively few words that are statistically determined to be the most frequent in the given language, based on a very large number of sample documents in that language. The sample documents for the represented language form a training corpus. The WFT also contains a Normalized Frequency Occurrence (NFO) value representing the frequency of occurrence of each word in the language. Associated with each WFT is an accumulator that stores and accumulates NFO values.
To determine the identity of an unknown text, a word from the text is compared to all the words in all of the WFTs. Whenever a word is found in any WFT, that word""s associated NFO value is added to a current total in the associated accumulator. In this manner, the totals in the associated accumulators increase as additional words are successively sampled from the unknown text. Processing stops either at the end of the unknown text file, or after a predetermined number of sampled words have been processed. The language corresponding to the accumulator with the highest total NFO value is identified as the language of the text.
The method described in Martino appears to provide advantages over Schmitt""s trigram-based language identification. However, according to Martino, the method requires approximately 100 words to be read from the unknown text to identify the language in which it is written, and several hundred words are preferred. In addition to the large number of samples which must be taken, the success of the Martino device is dependent upon the type of unknown text. Because the most frequently occurring words in most languages are predominantly function words such as pronouns, articles and prepositions, this method has limited success when the unknown text does not contain such words. For example, the Martino device appears to have limited success when applied to highly technical documents and small texts such as a title, header or query. Furthermore, significant time and expense is required to generate the word frequency tables.
What is needed, therefore, is a system that efficiently and accurately identifies the language in which a text is written when provided with a relatively few number samples of the text. System performance should not be adversely affected by a limited training corpus and should not depend on the content or length of the unknown text.
The above and other drawbacks of conventional language identification systems are overcome by the language identification system of the present invention. One aspect of the invention is a language identification system for automatically identifying a language in which an unknown input text is written based upon a probabilistic analysis of predetermined portions of words sampled from the input text which reflect morphological characteristics of natural languages.
In another aspect of the invention an automatic language identification system is disclosed. The automatic language identification system determines in which language of a plurality of represented languages a given text is written. This determination is based upon a value representing the relative likelihood that the text is a particular one of the represented languages due to the presence in the text of a predetermined character string that contains morphological features of the represented languages. The relative likelihood is derived from a relative frequency of occurrence of the character strings in each of a plurality of language corpuses, within each language corpus corresponding to one of the plurality of represented languages. In one embodiment, the predetermined character string is a three-character word ending.
In another aspect of the invention an automatic language identification system for determining the source language of a text is disclosed. The automatic language identification system comprises a language corpus analyzer that generates, for each of a plurality of fixed-length suffixes extracted from at least one of a plurality of language corpuses, a plurality of probabilities associated with the suffix and one of the plurality of represented languages. Each of the language corpuses represents a natural language and each of the probabilities represents a relative likelihood that the text is the associated language due to the presence of the associated suffix in the text. The relative likelihood is derived from a relative frequency that the associated suffix occurs in each of the plurality of language corpuses. The automatic language identification system also comprises a language identification engine. The language identification engine determines, for each of the represented languages, an arithmetic sum of the relative probabilities for all the suffixes which appear in the text. The source language is determined to be the represented language having a greatest arithmetic sum of relative probabilities.
In one embodiment of this aspect of the invention the language corpus analyzer includes a means for generating, for each of the plurality of language corpuses, a frequency list that includes a normalized frequency value indicating a number of times the extracted suffix appears in a corresponding language corpus. The language corpus analyzer also includes a means for generating a probability table that contains the above plurality of probabilities based upon the normalized frequency values in the frequency lists.
In another embodiment of this aspect of the invention, the language corpus analyzer also includes a parser for parsing the language corpuses to generate parsed words, and a suffix extractor for extracting the suffixes from each of the parsed words. In another embodiment of this aspect of the invention the language corpus analyzer also includes a format filter for formatting the language corpuses for the language corpus analyzer.
In still another embodiment of this aspect of the invention the language identification engine includes a language determinator that accumulates the relative likelihood values which are associated with one of the languages and which are associated with the suffixes which appear in the given input text. In another embodiment of this aspect of the invention the probabilities are set to a predetermined negative value when the associated suffix does not appear in the language corpus corresponding to the associated language. In this embodiment, the arithmetic sum of relative probabilities is to exceed zero for a represented language to be considered to be the source language of the text. Specifically, the suffix is preferably a predetermined number of characters at the end of a word. The word ending is the right-most predetermined number of characters in a left-to-right alphabetic language and the left-most predetermined number of characters in a right-to-left alphabetic language.
Preferably, each of the plurality of language corpuses is derived from a plurality of documents in each of a variety of sources. Also, the language identification engine is preferably implemented in software and the probability table is a finite state machine embodied in software and compiled with the language identification engine.
In another aspect of the invention an automatic language identification system for determining a source language of a text is disclosed. The automatic language identification system includes a first means for generating, for each of a plurality of three character word endings extracted from at least one of a plurality of language corpuses, a plurality of probabilities associated with the word ending and one of the represented languages. Each of the language corpuses represents a natural language and each of the probabilities represents a relative likelihood that the text is the associated language due to the presence of the associated word ending in the text. The relative likelihood is derived from a relative frequency with which the associated word ending occurs in each of the language corpuses. The automatic language identification system also includes a second means for determining, for each of the represented languages, an arithmetic sum of the relative probabilities for all of the word endings which appear in the text. The source language is determined to be the represented language having the greatest arithmetic sum of relative probabilities.
In one embodiment of this aspect of the invention, the first means includes a third means for generating, for each of the language corpuses, a normalized frequency list. The frequency list includes a normalized frequency value indicating a number of times the extracted word ending appears in a corresponding language corpus. The first means also includes a fourth means for generating a probability table containing said plurality of probabilities using the normalized frequency values in the normalized frequency lists.
In another embodiment of this aspect of the invention, the language identification engine includes a language determinator means for accumulating the relative likelihood values associated with one of the languages which is associated with the word endings which appear in the text.
In another embodiment of this aspect of the invention, the probabilities are set to a predetermined negative value when the associated word ending does not appear in the language corpus corresponding to the associated language.
In another aspect of the present invention, a method for identifying the language of a text is disclosed. The language is determined to be one of a plurality of languages, each of which is represented by a language corpus. The method comprises the steps of: a) parsing a word from a language corpus; b) extracting all suffixes from the parsed words, the suffixes defined as being the last three characters of a word; c) updating a suffix frequency list corresponding to the language corpus with the extracted suffix and a normalized frequency of occurrence of the extracted suffix in the corresponding language corpus; d) repeating steps a) through c) for each of the plurality of language corpuses to result in a plurality of frequency lists, each associated with one language corpus; and e) creating a probability table containing probabilities derived from the normalized frequency of occurrences and representing a relative likelihood that the language of the text is one of the represented languages due to the presence of the suffix.
In one embodiment of this aspect of the invention, the method also includes the step of: f) substituting a negative value in the table for all probabilities having a value of zero. In another embodiment, the method includes the additional steps of: g) parsing the text to generate a sample word; h) extracting a suffix from the sample word; l) retrieving the associated probabilities for each of the extracted sample suffix in the probability table; j) summing, for each of the represented languages, the associated probabilities retrieved from the probability table, resulting in an accumulated relative likelihood value that language of the text is the corresponding language due to the appearance of the sampled suffix in the text; k) repeating steps g) though j) for a predetermined number of sampled words; and l) selecting one of the represented languages as the language of the text, provided the selected language has a highest accumulated probability value greater than zero.
Embodiments of the present invention can be used for all text-oriented software, ranging from the traditional office applications (word processors, presentation managers, spelling checkers, etc.) and document management systems, to emerging markets for Intranet/internet applications (browsers, search engines) and information retrieval systems. In particular the present invention can be used to enable any text-oriented application to more accurately and efficiently identify the source language of a given text without any user intervention. Furthermore, applications handling incoming documents of unknown and unpredictable origin can dynamically classify these documents according to their source language. Similarly, information retrieval systems for Intranet/internet search engines can provide enhanced functionality by filtering documents that are not in a user""s native language according to the user""s language profile.
Further features and advantages of the present invention as well as the structure and operation of various embodiments of the present invention are described in detail below with reference to the accompanying drawings. In the drawings, like reference numerals indicate like or functionally similar elements. Additionally, the left-most one or two digits of a reference numeral identifies the drawing in which the reference numeral first appears.