Automatic conversion of scanned documents into editable and searchable text requires the use of accurate and robust Optical Character Recognition (OCR) systems. OCR systems for English text have reached a high level of accuracy due to various reasons. One of the main reasons is the ability to preprocess English text down to isolated characters to provide as input to the OCR systems. Each character of English text can be isolated because of the non-touching nature of printed English text. However, touching scanned characters present a challenge to the OCR systems and reduce their accuracy when the pitch is variable.
Arabic scanned text includes a series of touching characters and is therefore harder to segment into characters. Another difficulty is that Arabic text may include many dots and accent marks placed above or below the letters to indicate the pronunciation of the letter and the vowel that follows it. This inhibits known preprocessing techniques designed for English from accurately processing Arabic text.
A further characteristic of Arabic text is that the Arabic text can be written with or without the accent marks that indicate the vowels. Additionally, while English text can have either an uppercase representation or a lowercase representation, many Arabic letters include three or four shapes depending on whether the letter is placed at the beginning of a word, at the middle of the word, at the end of the word, or as a standalone letter. Therefore, the various combinations possible with Arabic text due to the accent marks and the location of a letter within a word makes preprocessing Arabic text with present OCR preprocessing systems inaccurate.
In addition, for images having more than one column of Arabic text and non-text items, the Arabic text associated with each column may vary in font size, font style, font color, etc. Due to the varying font size, neighboring columns may not line up and cannot be accurately segmented.
Therefore, there is a need for a method and system to preprocess an image having a plurality of columns, wherein the plurality of columns includes one or more of Arabic text and non-text items.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the invention.