The subject invention relates generally to human language recognition technology. More particularly, the invention relates to a technique for identifying the language used in a computerized document.
Computers and computer networks have intensified the transmission of coded documents between people who speak and write in different natural languages. The internet has recently accelerated this process. This results in several problems. In the prior art, for example, when an electronic document was sent across national boundaries, computer system operations were interrupted so that a human being could determine the natural language of a received document before a given operation such as selecting, displaying, printing, and so forth which may be dependent upon the peculiarities of an given natural language. In the context of an internet search, unless the user is multilingual, he is likely to be interested only in the retrieved documents in his native language.
The invention described herein eliminates the need for such human intervention by automatically determining the correct natural language of the computer recorded document.
Prior to the applicants"" own contributions to the art, the general problem was recognized in the prior art. In the area of automated language identification of coded text, the prior art used n-gram character based systems, which handle each character multiple times, a process which consumes a great deal of system resource when compared to the applicants"" word-based technique described below. In speech recognition systems, language recognition uses language and speech characteristics, e.g., trigrams or emphasis, which require large amounts of text to be parsed and measured, and large amounts of time for processing. These techniques are based on some form of matching algorithm based on language statistics that are not meaningful in a linguistic context.
Prior systems using trigrams, n-grams, and other artificial divisions in a computerized text are not considered reliable, and they are very slow and consume considerable computer time, as they handle each character multiple times for a document, e.g., each document character appears in three different trigrams. Characteristics measured, or derived from, but which are not actual components of written languages such as trigrams or letter sequences, have limited success in identifying the correct language, and require large amounts of text to be parsed and measured. Similarly, prior systems which depend on the attributes of individual characters and their local contexts are also limited when applied to the problem of identifying a language.
In the parent application and the invention described herein, none of the prior art techniques, e.g., classifying language by signal waveform characteristics, trigrams, n-grams, or artificial divisions of written language, were used. In both inventions, words are read from a computer document and compared to predetermined lists of words selected from a plurality of languages of interest. The word lists comprise relatively few of the most commonly used words in each language; statistically, a significant percentage of all words in any document will be the most common words used in its language. The language or genre of the document is identified by a process that determines which language""s word-list most closely matches the words in the document. In the parent application, the closeness of match is determined by the weight of the normalized frequency of occurrence of listed words in each language or genre of interest. Each language""s word-list and the associated frequency of occurrence for each word in the list is kept in a Word Frequency Table (WFT). The WFT is linked with a respective accumulator whose value is increased each time a word from an inputted document matches one of the common words in one of the tables. In the parent application, the process adds the word""s Normalized Frequency of Occurrence (NFO), as found in the WFT, to the current sum in the accumulator associated with the respective language. When processing stops, the identified language is the language associated with the highest-valued accumulator. Processing may stop either by reaching the end of the document or by achieving a predetermined confidence in the accumulated discrimination.
In the invention which is the subject of this application and which is more fully described below, it has been determined that weighting in the accumulation process described in the parent application can be eliminated if the actual frequency of occurrence of words in each of the candidate natural languages can be established and word tables having a substantially equivalent coverage of the respective candidate languages assembled.
It is therefore an object of the invention to identify the natural language in which a computer stored document is written from a plurality of candidate languages in a most efficient manner.
This object and others are accomplished by a technique for identifying a language in which a computer document is written. Words from the document are compared to words in a plurality of word tables. Each of the word tables is associated with a respective candidate language and contains a selection of the most frequently used words in the language. The words in each word table are selected based on the frequency of occurrence in a candidate language so that each word table covers a substantially equivalent percentage of the associated candidate language. A count is accumulated for each candidate language each time one of the plurality of words from the document is present in the associated word table. In the simple counting embodiment of the invention, the count is incremented by one. The language of the document is identified as the language associated with the count having the highest value.
The speed of language determination by this invention is very fast, because only a relatively small number of words need to be read from any document to reliably determine its language or genre.
Further, an advantage of the present invention is that only a few words, e.g., 25-200, need be contained in the Word Frequency Table for each candidate language of interest, so that in practice each word is compared with only a relatively small number of words for reliable language recognition. As discussed below, it is important that the word selected for the words frequency tables for each language cover a commensurate percentage of the frequency of occurrences in their respective languages.