Generally speaking, optical character recognition involves the parsing of an image of a document into individual symbols and groups of symbols, and comparing the images of the symbols to pattern information representative of various characters related to the letters of an alphabet, numbers, and the like. To increase the accuracy of the recognition process, OCR engines employ techniques that are based upon the characteristics of a particular language. For instance, information about a particular language can be used to select appropriate classifiers, dictionaries, bigram and trigram symbol probabilities, as well as to recognize language-specific formats for dates, numbers, etc.
In the past, if an OCR system was capable of recognizing text in different languages, the user was required to manually specify the language of the text in a scanned image, to enable the OCR system to accurately recognize the symbols and words in the document image. For a single-language document, this task was relatively simple. However, for optimal OCR processing of multi-lingual pages, different zones containing text in different respective languages needed to be demarcated, and each zone identified with the correct language label. The need for such manual intervention can be labor intensive, which results in greater expense and significantly slows down the overall image-to-text conversion process.
As businesses continue to grow globally, multi-lingual documents are becoming more common. Examples of such documents include user manuals that are targeted for multiple countries, and hence might have multiple languages on one page, and travel brochures which provide concise amounts of information in a variety of multi-lingual layouts. In these types of documents, the same type of information might be described in different languages in either different paragraphs, columns or pages. Other examples of multi-lingual documents include airline magazines that provide information for international tourists and business travelers, international business correspondence which may have a heading in the language of the originating country and a main body in the language of the destination country, and immigration documents which contain instructions in multiple languages. Thus, there is a growing need for the ability to automatically discriminate between, and identify, different languages in a single document.
In the past, efforts at automatic language identification have employed one of two general approaches. In one approach, the language identification relies on features that are extracted from images of word tokens. Examples of this approach are described, for example in T. Nakayama and A. L. Spitz, "European Language Determination From Image", Proc. of Int. Conf. on Document Analysis and Recognition, Tsukuba, Japan, 1993; P. Sibun and A. L. Spitz, "Language Determination: Natural Language Processing From Scanned Document Images", Proc. of Conf. on App. Natural Language Processing, Stuttgart, Germany, 1994; and A. L. Spitz, "Script And Language Determination From Document Images", Proc. of Symp. on Document Analysis and Information Retrieval, Las Vegas, Nev., 1994. Techniques of the type described in these references require a significant amount of text in the subject language, in order for the identification to be reliable. If the text language changes on a relatively frequent basis, e.g., from line to line, it is not possible to obtain sufficient statistical feature-based evidence to distinguish one language from the other.
A second approach to language identification utilizes word frequency and bigram probabilities. Examples of this approach are described in H. S. Baird, D. Gilbert, and D. J. Ittner, "A Family Of European Page Readers", Proc. of Int. Conf. on Pattern Recognition, Jerusalem, Israel, 1994; and D. Lee, C. Nohl, and H. Baird, "Language Identification In Complex, Unoriented, And Degraded Document Images", Proc. of IAPR Workshop on Document Analysis Systems, Malvern, Pa., 1996. This to approach is only applicable to documents of the type in which each page contains text in a single language. It does not provide the capability to distinguish between two different languages on the same page, absent prior manual segmentation. Furthermore, it requires document images having relatively high fidelity, in order to provide reliable transition probabilities for the language models.
It is desirable, therefore, to provide a system for automatically distinguishing between and identifying multiple languages which does not require prior manual input and which is capable of partitioning an image into homogenous language regions, to reliably identify a plurality of different languages on a single page, and thereby enable optical character recognition to be effected with greater speed and accuracy.