The subject invention relates generally to human language recognition technology. More particularly, the invention relates to a technique for identifying the language used in a computerized document.
Computers and computer networks have intensified the transmission of coded documents between people who speak and write in different natural languages. The internet has recently accelerated this process. This results in several problems. In the prior art, for example, when an electronic document was sent across national boundaries, computer system operations were interrupted so that a human being could determine the natural language of a received document before a given operation such as selecting, displaying, printing, and so forth which may be dependent upon the peculiarities of an given natural language. In the context of an internet search, unless the user is multilingual, he is likely to be interested only in the retrieved documents in his native language, or at any rate, only those languages he reads. Furthermore, there is a increasing use of visual and audio segments, both in advertising materials and educational products as well as other items available on the internet. It is extremely useful, before the time consuming download of a visual and audio segment, to assure that it is understandable, or alternatively, to provide for translation or substitution to a desired language.
The invention described herein eliminates the need for such human intervention by automatically determining the correct natural language of the computer recorded document.
Prior to the applicants' own contributions to the art, the general problem was recognized in the prior art. In the area of automated language identification of coded text, the prior art used n-gram character based systems, which handle each character multiple times, a process which consumes a great deal of system resource when compared to the applicants' word-based technique described below. In speech recognition systems, language recognition uses language and speech characteristics, e.g., trigrams or emphasis which require large amounts of text to be parsed and measured, and large amounts of time for processing. These techniques are based on some form of matching algorithm based on language statistics that are not meaningful in a linguistic context.
Prior systems using trigrams, n-grams, and other artificial divisions in a computerized text are not considered reliable, and they are very slow and consume considerable computer time, as they handle each character multiple times for a document, e.g., each document character appears in three different trigrams. Characteristics measured, or derived from, but which are not actual components of written languages such as trigrams or letter sequences, have limited success in identifying the correct language, and require large amounts of text to be parsed and measured. Similarly, prior systems which depend on the attributes of individual characters and their local contexts are also limited when applied to the problem of identifying a language.
In the invention described herein, none of the prior art techniques, e.g., classifying language by signal waveform characteristics, trigrams, n-grams, or artificial divisions of written language, were used. In both inventions, words are read from a computer document and compared to predetermined lists of words selected from a plurality of languages of interest. The word lists comprise relatively few of the most commonly used words in each language; statistically, a significant percentage of all words in any document will be the most common words used in its language. The language or genre of the document is identified by a process that determines which language's word-list most closely matches the words in the document.
In related applications, the applicants have taught that the closeness of match can be determined by a weighted or nonweighted sum of the occurrences of the words in the word lists for each language or genre of interest. The nonweighted sum is called the simple counting embodiment. Each language's word list and the associated frequency of occurrence for each word in the list is kept in a word table. The word table is linked with a respective accumulator whose value is increased each time a word from an inputted document matches one of the common words in one of the tables.
The present application is an improvement of the basic inventions of word counting for natural language determination. It should provide a relatively greater degree of discrimination in language identification than the weighted or simple counting methods proposed by the applicants in prior applications.