1. Field of the Invention
This invention pertains in general to language identification for optical character recognition (OCR).
2. Description of the Related Art
Optical character recognition (OCR) is a technology widely used to transform document images into computer readable text. In order to perform OCR, a dictionary or “language pack” which indicates the writing system for the OCR Engine to use must be specified to enable character recognition. A writing system consists of a set of character symbols and grammatical rules. Writing systems are generally associated with languages, with some languages having more than one writing system. For instance, the Japanese language can be written using the writing systems of Kanji, Hirigana and Katakana. Related writing systems such as English, German and French generally use the same character symbols but differ in lexicon and language structure. Other writing systems such as Hirigana and Katakana use the same grammatical structure with two different sets of symbols.
The quality of output from OCR is highly dependent on the selection of the correct dictionary. Therefore, the ability to identify the writing system a document is written in based on a document image in order to select a dictionary for performing OCR is necessary for automating the application of OCR technology. The simplest method of writing system based on a document image entails applying OCR to the document image using all possible dictionaries and determining the cross entropy of the document text obtained through OCR with a language model. Given that this method requires OCR and validation to be performed using a large number of writing system dictionaries, this method is very computationally expensive. Accordingly, performing OCR selectively with dictionaries specifying writing systems with high likelihood of occurrence (e.g. English) and sampling of document images are two techniques often used to reduce the computational expense of methods which use OCR as a first step. However, both of these techniques compromise the accuracy of this method.
Alternative techniques for writing system identification use image analysis to determine the language of a document before using OCR. These techniques use two dimensional image features to determine characters in an image document. While two-dimensional images features can be used to accurately identify characters, these features are independent of local context reflecting grammatical rules and computationally expensive to generate and compare.