Character recognition is language-dependent. To perform certain character recognition on a document image, if a character recognition device is not suitable for a particular language of the document image, a high recognition rate is not obtained. To solve the language dependency aspects, a large number of prior art techniques has paid attention to certain aspects of language recognition. One exemplary prior art technique reduces the size of a document image for language recognition. The black running pixels are extracted from the reduced document image so as to generate minimal circumscribing rectangles. In the English language, the minimal circumscribing rectangles are fused with each other for every word to form a connected rectangle part. Since a number of characters is predetermined and constant for forming words in English, it is characterized that the height-to-width ratio of the minimal circumscribing rectangles often ranges from 2 to 6 or 7. On the other hand, the minimal circumscribing rectangles vary significantly in the Japanese language. In some instances, the minimal circumscribing rectangles are exceptionally long in Japanese that they usually do not appear in English. In other instances, the minimal circumscribing rectangles correspond to a single character in Japanese.
Based upon the above described length of the minimal circumscribing rectangles, one prior art technique determines the language of a document. The connected minimal circumscribing rectangles are grouped into three categories including short, middle and long for each character line or character area. That is, if a line is oriented in a horizontal direction of a page, the ratio is determined based upon the width and the height of the minimal circumscribing rectangles. For example, if the width/height ratio is equal to or smaller than two, the corresponding minimal circumscribing rectangle is categorized as a short one. Similarly, if the width/height ratio is ranges from two to six, the corresponding minimal circumscribing rectangle is categorized as a middle one. Lastly, if the width/height ratio is equal to or above six, the corresponding minimal circumscribing rectangle is categorized as a long one. The frequency in occurrence of each category is compared to a predetermined threshold value in order to determine the language such as English or Japanese.
Other characteristics are also considered in prior art techniques to determine a language. For example, Japanese Patent Publication Hei 11-191135 discloses a technique to measure a distance between adjacent minimal circumscribing rectangles. The relevant portions of the disclosures include Paragraphs 40, 45, 50, 56, FIG. 3 and FIG. 20. In Japanese, the shortest distance between two adjacent minimal circumscribing rectangles on the same line often occurs between two certain components of Chinese characters. These two components are called “hen” and “tsukuri.” However, in English, the shortest distance tends to occur between two adjacent minimal circumscribing rectangles of characters in the same word when the words are proportionally laid out.
The above described language recognition techniques are based upon certain unique characteristics of the English and Japanese languages. In the above prior art techniques, the unique characteristics must be learned and analyzed in order to make a determination. Furthermore, when other languages such as French or German are used in lieu of English, the above prior art techniques are not necessarily accurate in distinguishing either of other languages from the Japanese language. It becomes almost impossible for the prior art techniques to determine whether a particular language is either an Asian language or European/US language if the possibilities include multiple languages. For example, if the possibilities include Japanese, Chinese, Korean, English, French, German, Italian and Spanish, the Asian language group includes Japanese, Chinese and Korean while the European/US language group includes English, French, German, Italian and Spanish. In addition, the speed of the language recognition is critical for practical applications, and the prior art techniques are not keen on the high-speed processing speed.
The language recognition is useful for various reasons. One exemplary application of the language recognition is a pre-processing for an optical character recognition (OCR) system. In general, characters are recognized by applying a set of language-specific criteria including a dictionary. In other words, without the correct recognition of the language, the character recognition is not accurately accomplished. For this reason, it is highly desired to determine a language of text or document images. Another exemplary application is an automated deliver system to deliver an incoming correspondence to an appropriate destination within the same organization. For example, when a letter in English arrives from a customer, the letter is delivered to a customer service representative who reads English. Similarly, when a letter in Spanish arrives from a customer, the letter is delivered to a customer service representative who reads Spanish. The automated delivery system thus increases the efficiency in the organization.
In the language recognition system, certain aspects remain to be desired for improvement. As described above, it remains desired to recognize a particular language among a plurality of languages that include both Asian languages and European/US languages. It also remains desired that the language recognition process is accomplished at a high speed so that there is no substantial delay for the subsequent processes.