Automatic conversion of scanned documents into editable and searchable text requires the use of accurate and robust Optical Character Recognition (OCR) systems. OCR systems for English text have reached a high level of accuracy due to various reasons. One of the main reasons is the ability to preprocess English text down to isolated characters to provide as input to the OCR systems. Each character of English text can be isolated because of the non-touching nature of printed English text. However, touching scanned characters present a challenge to the OCR systems and reduce their accuracy when the pitch is variable.
Arabic scanned text includes a series of touching characters and is therefore harder to segment into characters. Another difficulty is that Arabic text may include many dots and accent marks placed above or below the letters to indicate the pronunciation of the letter and the vowel that follows it. This inhibits known preprocessing techniques designed for English from accurately processing Arabic text.
A further characteristic of Arabic text is that the Arabic text can be written with or without the accent marks that indicate the vowels. Additionally, while English text can have either an uppercase representation or a lowercase representation, many Arabic letters include three or four shapes depending on whether they are placed at the beginning of a word, at the middle of the word, at the end of the word, or as a standalone word. Therefore, the various combinations possible with Arabic text due to the accent marks and the location of a letter within a word makes preprocessing Arabic text with present OCR preprocessing systems inaccurate.
Therefore, there is a need for a method and system to consider the above characteristics of Arabic text to preprocess an image comprising Arabic text and non-text items for OCR.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the invention.