The subject invention relates generally to human language recognition technology. More particularly, the invention relates to a technique for identifying the language used in a computerized document.
Computers and computer networks have intensified the transmission of coded documents between people who speak and write in different natural languages. The internet has recently accelerated this process. This results in several problems. In the prior art, for example, when an electronic document was sent across national boundaries, computer system operations were interrupted so that a human being could determine the natural language of a received document before a given operation such as selecting, displaying, printing, and so forth which may be dependent upon the peculiarities of an given natural language. In the context of an internet search, unless the user is multilingual, he is likely to be interested only in the retrieved documents in his native language, or at any rate, only those languages he reads.
The invention described herein eliminates the need for such human intervention by automatically determining the correct natural language of the computer recorded document.
Prior to the applicants' own contributions to the art, the general problem was recognized in the prior art. In the area of automated language identification of coded text, the prior art used n-gram character based systems, which handle each character multiple times, a process which consumes a great deal of system resource when compared to the applicants' word-based technique described below. In speech recognition systems, language recognition uses language and speech characteristics, e.g., trigrams or emphasis which require large amounts of text to be parsed and measured, and large amounts of time for processing. These techniques are based on some form of matching algorithm based on language statistics that are not meaningful in a linguistic context.
Prior systems using trigrams, n-grams, and other artificial divisions in a computerized text are not considered reliable, and they are very slow and consume considerable computer time, as they handle each character multiple times for a document, e.g., each document character appears in three different trigrams. Characteristics measured, or derived from, but which are not actual components of written languages such as trigrams or letter sequences, have limited success in identifying the correct language, and require large amounts of text to be parsed and measured. Similarly, prior systems which depend on the attributes of individual characters and their local contexts are also limited when applied to the problem of identifying a language.
In the invention described herein, none of the prior art techniques, e.g., classifying language by signal waveform characteristics, trigrams, n-grams, or artificial divisions of written language, were used. In both inventions, words are read from a computer document and compared to predetermined lists of words selected from a plurality of languages of interest. The word lists comprise relatively few of the most commonly used words in each language; statistically, a significant percentage of all words in any document will be the most common words used in its language. The language or genre of the document is identified by a process that determines which language's word-list most closely matches the words in the document.
In the parent application, the applicants have taught that the closeness of match can be determined by the sum of the normalized frequency of occurrence of listed words in each language or genre of interest. Each language's word-list and the associated frequency of occurrence for each word in the list is kept in a word table. The word table is linked with a respective accumulator whose value is increased each time a word from an inputted document matches one of the common words in one of the tables. The process adds the word's normalized frequency of occurrence, as found in the word table, to the current sum in the accumulator associated with the respective language. When processing stops, the identified language is the language associated with the highest-valued accumulator. Processing may stop either by reaching the end of the document or by achieving a predetermined confidence in the accumulated discrimination.
However, the applicants have taught that weighting in the accumulation process is less preferred and that it can be eliminated if the actual frequency of occurrence of words in each of the candidate natural languages can be established and the word tables have a substantially equivalent coverage of the respective candidate languages assembled.
The present application is an improvement of the basic invention of word counting for natural language determination to allow the language identification in the most efficient and expeditious manner.