Automatic conversion of scanned documents into editable and searchable text requires use of accurate and robust Optical Character Recognition (OCR) systems. OCR systems involve recognition of text from scanned images by segmenting an input image of the text into characters. To recognize text from scanned images, an OCR system is initially trained with sample images of characters and their corresponding ground truths. Upon continuous training of an OCR system to recognize the text in a script, the OCR system learns to identify different characters in the text.
OCR systems for non-cursive scripts, such as for English text have reached a high level of accuracy. One of the main reasons for this high level of accuracy is the ability to automatically preprocess non-cursive scripts down to isolated characters to provide as input to the OCR systems. Each character in a non-cursive script can be isolated due to the inherent characteristic of non-cursive scripts to be non-touching. Once each character is isolated, a corresponding character level ground truth may be provided in order to train the OCR system.
However, with cursive scripts such as an Arabic script, isolating individual characters in order to train an OCR engine is complex. This is due to the touching nature of characters written in Arabic script. Additionally, Arabic text may include diacritics, such as dots and accent marks placed above or below a letter to indicate the pronunciation of the letter. This inhibits known preprocessing techniques used by OCR systems designed for recognizing non-cursive text from accurately processing the Arabic text. Further, many Arabic letters include three or four shapes depending on whether the letter is placed at the beginning of a word, at the middle of the word, at the end of the word, or as a standalone letter. These characteristics of Arabic text make it difficult to automatically segment Arabic text into individual characters.
Currently, to train an OCR engine for recognizing Arabic text, individual characters of a word in the Arabic text need to be manually demarcated and the corresponding ground truths entered for each demarcated character. When a large set of documents are used to train an OCR engine, the manual demarcation of the characters in a word and the subsequent entering of the ground truth for each character is tedious and error prone.
Therefore, there is a need for a method and apparatus for automatically identifying character segments for character recognition based on one or more of a word level and a line level ground truth.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the invention.